Reward-Respecting Subtasks for Model-Based Reinforcement Learning

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.


Introduction
In model-based reinforcement learning (MBRL), a reinforcement learning agent learns a model of the transition dynamics of its environment and then converts that model into improvements in its policy and, commonly, in its approximate value function. The conversion process, generically referred to as planning, is typically computationally expensive and may be distributed over many time steps (or even performed offline). MBRL is well-suited to environments whose transition dynamics are relatively simple but whose policy or value function are complex (such as in Chess and Go). More generally, MBRL may enable dramatically faster adaptation whenever the agent is long-lived, the environment is non-stationary, and much of the environment's transition dynamics are stable (as approximately modeled by the agent).
For planning to be tractable on large problems, an AI agent's model must be abstract in state and time. Abstraction in state is important because the original states of the world may be too numerous to deal with individually, or may not be observable by the agent. In these cases, how the state should be constructed from observations is an important problem on which much work has been done with deep learning (e.g., Mnih et al., 2015). We do not address state abstraction in this paper other than by allowing the agent's state representation to be a non-Markov feature vector.
This paper concerns how we should create and work with environment models that are abstract in time. The most common way of formulating temporally-extended and temporally-variable ways of behaving in a reinforcement learning agent is as options (Sutton, Precup, & Singh, 1999), each of which comprises a way of behaving (a policy) and a way of stopping. The appeal of options is that they are in some ways interchangeable with actions. Just as we can learn models of action's consequences and plan with those models, so we can learn and plan with models of options' effects. There remains the critical question of where the options come from. A common approach to option discovery is to pose subsidiary tasks such as reaching a bottleneck state or maximizing a sensory signal. Given a subtask, the agent can then develop temporally abstract structure for its cognition by following a progression in which the subtask is solved to produce an option, the option's consequences are learned to produce a model, and the model is used in planning. We refer to this progression (SubTask, Option, Model, Planning) as the STOMP progression for the development of temporally-abstract cognitive structure, which was first described in the the original paper on options.
The conceptual innovation of the current work is to introduce the notion of a reward-respecting subtask, that is, of a subtask that optimizes the rewards of the original task until terminating in a state that is sometimes of high value. Rewardrespecting subtasks contrast with commonly used subtasks, such as shortest path to bottleneck states (e.g., McGovern & Barto, 2001;Simsek & Barto, 2004), pixel maximization (Jaderberg et al., 2017), and diffusion maximization (c.f. Machado, Barreto, & Precup, 2021), which maximize the cumulative sum of a signal other than the reward of the original task. For example, consider the two-room gridworld shown inset in Figure 1, with a start state in one room, a terminal goal state in the other, and a hallway state in-between. The usual four actions move the agent one cell up, down, right, or left unless blocked by a wall. A reward of +1 is received on reaching the goal state, which ends the episode. Transitions ending in the gray region between the start and hallway states produce a reward of −1 per step, while all other transitions produce a reward of zero. The discount is γ = 0.99, so the optimal path from start to goal, traveling the roundabout path that avoids the negative reward, yields a return of 0.99 18 ≈ 0.83. The hallway state is a bottleneck and thus a natural terminating subgoal for a subtask on this problem (as in Solway et al., 2014). With a reward-respecting subtask, the agent learns a way to the hallway that maximizes the reward along the way; in this gridworld it finds the option that goes down from the start and around the negative rewards. In contrast, solving this subtask without taking the reward into consideration leads to the shortest path from start to hallway, passing through the negative rewards.
Which of these two options are more useful when their models are used in planning? Assuming the optimal options for both subtasks, and that all the models of the options and actions are known, we can contrast the progress of planning by value iteration with the primitive actions only, and with the primitive actions augmented by the models of the two options. The results, shown in Figure 1, are much as you would expect. Planning using the shortest-path option is less efficient than when using primitive actions only; likely because the option being considered is not part of the final solution. Importantly, planning with the model of the option based on the reward-respecting subtask is most efficient.

Reward-respecting subtasks
We now define the primary task, GVF subtasks, reward-respecting subtasks, and the reward-respecting subtask used to produce Figure 1. This is the first step in the STOMP progression for developing temporally-abstract cognitive structure. This, alongside option learning, is the main focus of this extended abstract. In the longer version of this work (c.f. Sutton et al., 2022), we also discuss the model learning and planning steps, with further details on the results presented.
We consider an agent interacting with an environment in a sequence of episodes, each beginning in environment state S 0 ∈ S and ending in the terminal state S L = ⊥ at time step L ∈ N. At time steps t < L, the agent selects an action A t ∈ A, and the environment emits a reward R t+1 ∈ R ⊂ R and transitions to a next state S t+1 ∈ S + ⊥ with probability p(s , r|s, a) . = Pr{S t+1 = s , R t+1 = r | S t = s, A t = a}, where A, S, and R are all finite sets. Capitalized letters denote random variables that differ from episode to episode. The agent's primary task is to find a policy π : S × A → [0, 1] that maximizes the expected discounted sum of rewards, E R 1 + γR 2 + · · · + γ L−1 R L , where γ ∈ [0, 1) is the discount rate.
Solution methods for the primary task often involve approximating the value function v π : S → R for the agent's current where the expectation is conditional on the actions being selected according to π. We consider linear approximations to v π in this work, where each state is converted to a feature vector by a function x : S → R d (with x(⊥) . = 0), which might be provided by all the layers of a neural network except for the last, or by a domain expert. Then, v π (s) is approximated as linear in x(s) and a modifiable weight vector where w i and x i (s) are individual components of w and x(s). We now generalize the primary task to include a full range of tasks for the episodic setting, based on the framework of general value functions (GVFs; Sutton et al., 2011). Instead of maximizing the sum of rewards R t , we maximize the sum of a cumulant C t . = c(S t ) for some function c : S → R. Instead of the accumulation stopping only at the terminal state, it stops at every state with probability β(S t ) for some function β : S → [0, 1] (with β(⊥) . = 1). Upon stopping, there is an additional contribution to the accumulation, a final stopping value z(S t ) for some function z : S → R (with z(⊥) . = 0). The functions c, z define the task, and the policy and stopping function, π, β, which together constitute an option, define a solution. If option π, β were initiated in state S t , then A t and subsequent actions would be selected according to π until the option ended, or stopped, according to β at step K. Given a GVF task c, z, the objective is to find an option that maximizes E C 1 +γC 2 +· · ·+γ K−1 C K +γ K−1 z(S K ) for any state S 0 the option started in. That is, to maximize the GVF: v c,z π,β (s) where the expectation is conditional on the actions determined by π and the stopping time K is determined by β. Note that "stopping" is not termination, as it does not affect the actual flow of events in the trajectory. The primary task is a special case of a GVF task in which C t . = R t and Z t . = −∞, ∀t < L. Shortest path subtasks are defined by C t . = −1 and z(s) . = 0 at subgoal states and z(s) . = −∞ otherwise. GVF tasks includes all the common subtasks in the literature including (if we allow the state to formally include agent-internal variables) all those based on curiosity and other intrinsic motivations.
Reward-respecting subtasks are GVF tasks with C t = R t , and whose stopping values take into account the estimated value of the state stopped in. The stopping values cannot be equal to the estimated values, as then the subtask would approximate the primary task (exactly, in the tabular case) and solving it would add nothing new.
In this paper we consider reward-respecting subtasks of feature attainment, in which the objective is to maximize an individual feature of the state representation at stopping time while being mindful of the rewards received along the way. We will use the superscript position to indicate the task number. We assume that we can have at most one task per feature, so the feature index i can also serve as a task number. The subtask for maximizing feature i has stopping-value function z i : where w and its ith component, w i , are the weights for approximating the primary task, andw i is a bonus weight for subtask i, provided as part of defining the subtask. Note that under the linear form forv, z i does not depend on w i . The quantityw i x i is called the stopping bonus. Generally, it is only useful to construct subtasks for attaining a feature i if its contribution to value on the primary task, w i , is sometimes high and sometimes low. If w i never varied, then its static value could be fully taken into account without planning. If w i does vary, then its bonus weight should be set to one of its higher values so that an option can be learned in preparation for the occasional times at which w i is high.
Finally, the reward-respecting subtask used in producing Figure 1 was for the tabular feature for the hallway state, with bonus weightw h = 1, where h denotes the index of the feature for the hallway state. The shortest-path subtask used C t = −1 and stopped upon reaching the hallway or terminal goal states.

Option learning for feature attainment
In this section we specify the off-policy learning algorithms we use to approximate the optimal value functions and optimal options. This is the second step in the STOMP progression for developing temporally-abstract cognitive structure.
Define the optimal value function v i * : S → R for the reward-respecting subtask for attaining the ith feature by (1) with C t . = R t , z . = z i , and the π and β that maximize the value. Define the optimal option as that pair π, β . We will describe these algorithms in a somewhat unusual way that lets us cover all the cases very compactly and uniformly. First we define a general TD (Temporal Difference) error function δ : R 4 × [0, 1] → R: δ (c, z, v, v , β) .
Procedure UWT w, e, ∇, αδ, ρ, γλ(1−β) : e ← ρ(e + ∇) w ← w + αδe e ← γλ(1−β)e We use various TD errors to specify our learning algorithms, together with a general update procedure for learning with traces, which we call Up-dateWeights&Traces (UWT), presented on the right. The first two arguments to UWT are a weight vector and an eligibility-trace vector. These arguments are both inputs and outputs; the same pair are expected to be provided together on every time step. The weight vector is the ultimate result of learning. The eligibility trace is a short-term memory that helps with credit assignment. The third argument is usually a gradient vector with respect to the weight vector. The fourth and sixth arguments to UWT are scalars-the names of the formal arguments are just suggestive of their use. Finally, the fifth argument is a scalar importance-sampling ratio used in off-policy learning (for on-policy learning it should be one).
Now we show how to use these tools to learn the value functions and options for the subtasks, off-policy. Let F ⊂ {0, . . . , d} be the set of features for which we have subtasks. Each subtask will have a weight vector To learn off-policy we need the importance sampling ratios ρ i t . = π(At|St,θ i ) /µ(At|St), where µ : A × S → [0, 1] is the behavior policy. Then, on each time step on which S t is nonterminal, for each subtask i ∈ F we do: where here we use the shorthands z i t .
= z i (S t ) and β i t .
= β i (S t ). Under this algorithm, the learned approximate valueŝ v(x(s), w i ) . = w i x(s) come to approximate the optimal subtask values v i * (s) .
= max π,β v i π,β (s), ∀s ∈ S and i ∈ F, and the options π(·|·, θ i ), β i come to approximate the corresponding optimal options. To learn the options described in Figure 1, we used a behavior policy that selected all four actions with equal probability. The state-feature vectors were 1-hot, producing a tabular representation with d = 72. The policy was of the softmax form with linear preferences: π(a|s, θ) . s,b) , where the state-action feature vectors φ(s, a) ∈ R d , ∀s ∈ S, a ∈ A were again 1-hot or tabular (d = 288). We ran for 50,000 steps and averaged the results over 100 runs.

Multiple options and stochasticity
The results in Figure 1 were mostly to illustrate the idea of reward-respecting subtasks and to serve as a simple example on how to formulate them with the language of GVFs. We now present results for a larger STOMP progression with multiple subtasks in a stochastic environment, and we compare reward-respecting options to other choices. As before, we focus on planning performance here, with the other steps of the STOMP progression being discussed in the longer version of this paper (Sutton et al., 2022). Specifically, we used the four-room episodic gridworld depicted in Figure 2a, with a start state in one room, and a terminal goal state in another. As in the two-room gridworld, a reward of +1 is received on reaching the goal, which ends the episode, and passing through the gray region produces a reward of −1 per step, while all other transitions produce a reward of zero. The discount factor is γ = 0.99. There are four actions, available in all nonterminal states: up, down, left, and right. Importantly, each action causes the agent to stochastically move in the corresponding direction with probability 2 /3 and in one of the other three directions with probability 1 /9.
We defined four subtasks, each subtask directed toward reaching one of the four hallways states. The hallways are labeled in Figure 2a as H1-H4. Notice that reward-respecting options take into consideration the fact that the environment's stochasticity is not under the agent's control. The options often take the shortest path to the hallway or the goal state, but the uncontrolled stochasticity and the negative reward region causes some options to prefer a longer (and safer) path. For example, when in the South-West room, options sometimes take the roundabout way to reach H3 and H4. Besides the shortest-path option, we consider eigenoptions (Machado et al., 2017;2018) and the option-critic (Bacon, Harb, & Precup, 2017) as alternative option construction methods. The option-critic parameterizes options' policies and uses option-value functions and the policy gradient theorem to learn options that maximize the return, leading to rewardrespecting options; but they were never used for planning. Eigenoptions are options designed for temporally-extended exploration that do not take rewards into consideration. Figure 2b shows that reward-respecting options are more appropriate than other subtask formulations when these options are to be used in the STOMP progression. Moreover, considering multiple options in the STOMP progression does not impact performance. The stochastic nature of the environment does not impact the applicability of our approach.
(a) Reward-respecting options learned in a stochastic gridworld. Four options are learned as solutions to four subtasks seeking to attain the four hallway states (labeled H1-H4). Stochastic transitions make states close to the negative reward region (shaded gray) less desirable due to the natural uncertainty over their return.
(b) Progress of planning with different types of options. Each line is averaged over 30 runs and the shading represents the standard error. Figure 2: Illustration of the STOMP progression in a four-rooms gridworld.

Conclusions, limitations, and future work
In this paper, we introduced reward-respecting subtasks and showed how they lead to more efficient planning because they are more likely to be part of the agent's final plan. We did so through the language of general value functions (GVFs), which allowed us to introduce a general procedure that unifies option learning, value learning, and model learning. Our formulation can also be seen as a formalization of the option discovery problem that unifies option discovery methods. We also introduced reward-respecting subtasks of feature attainment. This simple choice reduces the option discovery problem to one of deciding what features to maximize.
We have focused on presenting the problem formulation and the option learning step of the STOMP progression in a general way in order to be amenable to different forms of function approximation. We did the same for model-learning an planning in the extended version of this work (Sutton et al., 2022). We considered only time abstraction, not state abstraction, but it is natural to ask how our approach would perform with different forms of function approximation. An interesting related question is how feature attainment interacts with representation learning techniques that automatically construct state abstractions.