E P R I N T A way around the exploration-exploitation dilemma.

For all animals the decision to explore comes with a risk of getting less. For example, a foraging bee might ﬁnd less nectar, or hunting hawk less prey. This loss is often formalized as regret. It’s been math-ematically proven that exploring an uncertain world with a speciﬁc goal always has some regret. This is why exploration-exploitation can be a dilemma. Given this proof we wondered if the common advice to “focus on learning and not the goal” might have mathematical merit. So we re-imagined exploration in the dilemma as an open ended search for any new information. We then developed a new minimal description of information value, which generalizes existing ideas like curiosity, novelty and information gain. We use this description to model the dilemma as a competition between strategies that maximize reward and information independently. Here we prove this competition has a no regret solution. When we study this solution in simulation – using classic bandit tasks – it outperforms standard approaches, especially when rewards are sparse.


Introduction
Decision making in the natural world often leads to a dilemma.As an example let's imagine a bee foraging in a meadow (Figure 1A).The bee could go to the location of a flower it's been to before to gather nectar.Or the bee go somewhere new, and explore.Exploration comes though with the risk of getting less nectar.Perfectly optimizing away this risk is a mathematically intractable problem; there is no way to explore without enduring some regret (1)(2)(3)(4), and so the decision can become a dilemma.
Resource gathering is not the only reason animals explore.Many animals, like our bee, explore out of curiosity (Figure 1B).This exploration lets them learn about their environment, developing an often simplified model that helps them in planning actions and making future decisions (5,6).Borrowing from the field of artificial intelligence we refer to these models as world models (7)(8)(9).World models offer a principled explanation for why animals are intrinsically curious (10)(11)(12)(13)(14)(15), and prone to explore even when no rewards are present or expected (16).
Curiosity raises the question of whether animals need to explore looking for specific goals or rewards are all.Perhaps we've misinterpreted their actions, and so misconceived of a fundamental problem in the learning and decision sciences.
Here we explore a bold conjecture: Exploration for reward is never needed.The only exploratory behavior an animal needs is that which builds its world model.
Our contribution is threefold.We define a new minimal (axiomatic) description for information value, which generalizes existing ideas like curiosity, novelty and information gain.In fact, the axioms let us formally disconnect information Fig. 1.Two views of exploration and exploitation.A. The classic dilemma: either exploit an action with a known reward (e.g., return to the previous plant) or explore other actions on the chance they will return a better outcome (e.g., find a plant with more flowers).B.Here we offer an alternative view of the dilemma, with two different competitive goals: maximize rewards (e.g., keep returning to known flower locations) or build a world model by learning new information (e.g., layout of the environment).Exploration here focused on learning in general, not on reward learning specifically.Artist credit: Richard Grant.
theory (17) from information value, suggesting we may have uncovered a new universal theory.Next we prove that the computer science method of dynamic programming (8,18) provides an optimal way to maximize this kind of information value.Finally, we describe a simple winner-take-all scheduling algorithm that can optimally solve a competition between strategies which independently maximize information value and reward.

Results
Tangible rewards are a conserved resource, but learned information isn't.For example, if a rat shares potato chip with a cage-mate, she must necessarily split up the chip leaving less food for herself.Whereas if student shares the latest result from a scientific paper with a lab-mate, they do not necessarily forget a portion of that result.These differences make reward and information different concepts, and so considering information as a kind of reward isn't inconsistent.
If information value isn't a reward, we need another way to study and value it.To do this we first looked to the field of information theory (17), but the problem of information value is not based in the statistical problem of transmitting symbols, as was Shannon's goal.It is based on the problem of learning and remembering them.
We have no mathematical reason to prefer any one kind of world model over any other.So we designed a new minimal definition, designed to overlap with all of them.
We must introduce some intial notation.We assume that time is a continuous value and denote increases in time using the differential quantity dt.We can then express changes in M (our world model, defined below) as a gradient, ∇M .We also assume that observations about the environment s are real numbers sampled from a finite state space s ∈ S, whose size is N (denoted S N ).Actions are also real numbers a, drawn from a finite space A K .Rewards Rt -when they appear -are binary (0, 1) and are provided only by the external environment.
Definition 1.We can now formally define a world model M as a finite set of real numbers, whose maximum size is L (M L ).We say that every world model has a pair of functions f and g.Learning of s at time t (i.e.st) by M is done by the invertible encoder function f , M t+dt = f (Mt, st) and Mt = f −1 (M t+dt , st).Memories ŝt about st are recalled by the decoder function g, ŝt = g(Mt, st).
The invertibility of f , denoted as f −1 , is a mathematical way to ensure that any observations encoded in the world model can also be forgotten.This is both an important aspect of real memory, and a critical point for our mathematical analysis.
The details of f and g define what kind of world model or memory M is.Let's consider some examples.If f adds states st to the memory, and g tests whether st is in M , then M is a model of novelty (20).If f counts states and g returns those counts, then M is a count-based heuristic (21,22).If f follows Bayes rule and g decodes the probability of st, then M is a Bayesian memory (9,15,23,29,30).If the size of M is much smaller than the size of the state space S N , then f can be seen as learning a latent or compressed representation im M (19,28,31,(33)(34)(35)(36)(37), and g decodes a reconstruction of s (ŝt) or future states (ŝ t+dt ).
A minimal description of information value.. To formalize information value we use two axioms that define a real valued function, E(s), that measures the value of any observation st given a world model M and a distance metric d.

Axiom 1 (Axiom of Change). The value of information E(st) depends only on the total distance M moves by making observation st.
This axiom does three important things.It ensures information value depends only on the world model, that value is a distance in memory, and that value learning has the Markov property (8).Now, let's unpack it.
By distance we mean a function δ = d(m, m ), where m ∈ M and m ∈ M are discrete memories drawn from two memories M and M .We define d so d ≥ 0 for all s ∈ S, and let = .0 only if M = M .Our definition of d does not require the distance in memories from M to M be the same as from M to M .Nor for the triangle inequality to hold.For the technically inclined, this definition makes d and so E a pre-metric.
In summary, Let E ≡ ||∆||.Different f and g pairs will naturally need different ways to measure distances in M .For example, in a novelty world model (20) either the hamming or Manhattan distance are applicable and would produce binary distance values, as would a count model (21,22).A latent memory (9,15) might instead use the euclidean norm of its own error gradient (38).While a probabilistic or Bayesian memory would likely use the Kullback-Leibler (KL) divergence (23,28).
Axiom 2 (Axiom of Equilibrium).To be valuable an observation st must be learnable by M By learnable we mean two things.
First, with every (re)observation of s, M should change.Second, the change in M must eventually reach a learned equilibrium.To formalize these we constrain the average gradient of M , so E ∇ 2 M ≤ 0.
Most attempts to value information rest their definition on information theory.Value might rest on the intrinsic complexity of an observation (i.e., its entropy) (39) or on its similarity to the environment (i.e., mutual information) (40), or on some other salience signal (41).In our analysis, learning alone drives value.This is because learning might happen on a true world model or with a faulty world model, or be about a fictional narrative.The observation might be simple, or complex.From a subjective point of view, which is the right point of view for value, all of these are the same; value depends only on the total knowledge gained.
Exploration as a dynamic programming problem.Dynamic programming is a popular optimization method because it guarantees value is maximized using a simple algorithm that always chooses the largest option.In Theorem 1 (see Mathematical Appendix) we prove that our definition of memory has one critical property, optimal substructure, that is needed for an optimal dynamic programming solution (18,42).The other two required properties, E ≥ 0 and the Markov property (18,42), are fulfilled by the Axiom 1.To write down our dynamic programming solution we introduce a little more notation.We let π denote an action policy, a function that takes a state s and returns an action a.We let δ denote the transition function, which takes a state-action pair (st, at) and returns a new state, s t+dt .This function acts as an abstraction for the actual world.For notational consistency with the standard Bellman approach we also redefine E(s) as a payoff function, F (Mt, at) (18).
The value function for F is, And the recursive Bellman solution to learn this value function is, For the full derivation of Eq 3 see the Mathematical Appendix, where we also prove that Eq 3 leads to exhaustive exploration of any finite space S (Theorems 2 and 3).
Scheduling a way around the dilemma.Remember that the goal of reinforcement learning is to maximize reward, an objective approximated by the value function VR(s) and an action policy πR.
Remember too that our overall goal is to find an algorithm that maximizes both information and reward value.To do that we imagine the policies for exploration and exploitation are possible "jobs" competing to control behavior.We know that, by definition, each of these jobs produces non-negative values: E for information or R for reinforcement learning.So our goal is to find an optimal scheduler for these two jobs.
To do this we further simplify our assumptions.We assume each action takes a constant amount of time, and has no energetic cost.We assume the policy can only take one action at a time, and that those actions are exclusive.Most scheduling solutions also assume that the value of a job is fixed, while in our problem information value changes as the world model improves.In a general setting however, where one has no prior information about the environment, the best predictor of the next value is the last or most recent value (42,43).We assume this precept holds in all of our analysis.
With these assumptions in place, the optimal solution to this kind of scheduling problem is known to be a purely local, winner-take-all, algorithm (18,42).We state this winnertake-all solution here as a set of inequalities where Rt and Et represent the value of reward and information at the last time-point.
To ensure that the default policy is reward maximization, Eq. 5 breaks ties between Rt and Et in favor of πR.In stochastic environments, M can show small continual fluctuations.To allow Eq. 5 to achieve a stable solution we introduce η, a boredom threshold for exploration.Larger values of η devalue information exploration and favor exploitation of reward.
The worst case algorithmic run time for Eq 5 is linear and additive in its policies.So if in isolation it takes TE steps to earn ET = T E E, and TR steps to earn rT = T R R, then the worst case training time for ππ is TE + TR.It is worth noting that this is only true if neither policy can learn from the other's actions.There is, however, no reason that each policy cannot observe the transitions (st, at, R, s t+dt ) caused by the other.If this is allowed, worst case training time improves to max(TE, TR).
Exploration without regret.Suboptimal exploration strategies will lead to a loss of potential rewards by wasting time on actions that have a lower expected value.Regret G measures the value loss caused by such exploration.G = V − Va, where V represents the maximum value and Va represents the value found by taking an exploratory action rather than an exploitative one (8).
Optimal strategies for a solution to the explorationexploitation dilemma should maximize total value with zero total regret.To evaluate dual value learning (Eq.5) we compared total reward and regret across a range of both simple, and challenging multi-armed bandit tasks.Despite its apparent simplicity, the essential aspects of the exploration-exploitation dilemma

E-greedy
With probability 1 − follow a greedy policy.With probability follow a random policy.
Annealed e-greedy Identical to E-greedy, but is decayed at fixed rate.

Bayesian reward
Use the KL divergence as a weighted intrinsic reward, sampling actions by a soft-max policy.

Random
Action are selected with a random policy (no learning) exist in the multi-armed bandit task (8).Here the problem to be learned is the distribution of reward probabilities across arms (Figure 2).To estimate the value of any observation st, we compare sequential changes in this probabilistic memory, M t+dt and Mt using the KL divergence (i.e.relative entropy; Figure 4A-B).The KL divergence is a standard way to measure the distance between two distributions (44) and is, by design, consistent with our axioms (see the Supplementary Materials for a more thorough discussion).
We start with a simple experiment involving a single high value arm.The rest of the arms have a uniform reward probability (Bandit I).This represents a trivial problem.Next we tried a basic exploration test (Bandit II), with one winning arm and one distractor arm whose value is close to but less than the optimal choice.We then move on to a more difficult sparse exploration problem (Bandit III), where the world has a single winning arm, but the overall probability of receiving any reward is very low (p(R) = 0.02 for the winning arm, p(R) = 0.01 for all others).Sparse reward problems are notoriously difficult to solve, and are a common feature of both the real world and artificial environments like Go, chess, and class Atari video games (45)(46)(47).Finally, we tested a complex, large world exploration problem (Bandit (IV) with 121 arms, and a complex, randomly generated reward structure.Bandits of this type and size are near the limit of human performance (48).
We compared the reward and regret performance of 6 artificial agents.All agents used the same temporal difference learning algorithm (TD(0), ( 8)); see Supplementary materials).The only difference between the agents was their exploration mechanism (Table 1).The e-greedy algorithm is a classic exploration mechanism (8).Its annealed variant is common in state-of-the-art reinforcement learning papers, like Mnih et al (( 45)).Other state-of-the-art exploration methods are models that treat Bayesian information gain as an intrinsic reward and the goal of all exploration is to maximize total reward (extrinsic plus intrinsic) (9,49).To provide a lower bound benchmark of performance we included an agent with a purely random exploration policy.
All of the classic and state-of-the-art algorithms performed well at the different tasks in terms of accumulation of rewards (right column, Figure 3).The one exception to this being the sparse low reward probability condition (Bandit III), where the dual value algorithm consistently returned more rewards than the other models.In contrast, most of the traditional models still had substantial amounts of regret in most of the tasks, with the exception of the annealed variant of the egreedy algorithm during the sparse, low reward probability task (left column, Figure 3).In contrast, the dual value learning algorithm consistently was able to maximize total reward with zero or near zero (Bandit III) regret, as would be expected by an optimal exploration policy.

Discussion
Past work.We are certainly not the first to quantify information value (40,50), or use that value to optimize reward learning (2,9,29,51,52).Information value though is typically framed as a means to maximize the amount of tangible rewards (e.g., food, water, money) accrued over time (8).This means that information is treated as an analog of these tangible or external rewards (i.e., an intrinsic reward) (9,12,23,29).This approximation does drive exploration in a practical and useful way, but doesn't change the intractability of the dilemma (1)(2)(3)(4).
At the other extreme from reinforcement learning are pure exploration methods, like curiosity (15,49,53) or PAC approaches (54).Curiosity learning is not generally known to converge on rewarding actions with certainty, but never-theless can be an effective heuristic (15,55,56).Within some bounded error, PAC learning is certain to converge (54).For example, it will find the most rewarding arm in a bandit, and do so with a bounded number of samples (57).However, the number of samples is fixed and based on the size of the environment (but see (58,59)).So while PAC will give the right answer, eventually, its exploration strategy also guarantees high regret.
Cost.It is not fair to talk about benefits without talking about costs.The worst-case run-time of a dual value algorithm is max(TE, TR), where TE and TR represent the time to learn to some criterion (see Results).In the unique setting where minimizing regret, maximizing data efficiency, exploration efficiency, and transfer do not matter, dual value learning can be a suboptimal choice.
Animal behavior.In psychology and neuroscience, curiosity and reinforcement learning have developed as separate disciplines (8,53,60).And they are separate problems, with links to different basic needs: gathering resources to maintain physiological homeostasis (61,62) and gathering information to plan for the future (8,54).Here we suggest that though they are separate problems, they are problems that can, in large part, solve one another.
The theoretical description of exploration in scientific settings is probabilistic (4,(63)(64)(65).By definition probabilistic models can't make exact predictions of behavior, only statistical ones.Our approach is deterministic, and so does make exact predictions.Our theory predicts that it should be possible to guide exploration in real-time using, for example, optogenetic methods in neuroscience, or well timed stimulus manipulations in economics or other behavioral sciences.
Artificial intelligence.Progress in reinforcement learning and artificial intelligence research is limited by three factors: data efficiency, exploration efficiency, and transfer learning (19).Our algorithm speaks directly to all three of these limits.By treating exploration as a problem in building a world model,

P R E P R I N T
our algorithm always ensures high quality exploration.The focus on the world model also means it can be naturally integrated with data efficient model-based reinforcement learning (8,66).Finally, as it builds a world model that is free of any task specific bias and so is ideal for later transfer or fine-tuning (67,68).
We describe here a simple and optimal algorithm to combine nearly any world model with any reinforcement learning algorithm.This effectively joins the two approaches to reinforcement learning -model-free and model-based -into an advantageous whole where exploration is model-based, but exploitation and reward learning is algorithmically model-free.
Everyday life.The uncertainty of the unknown can always be recast as an opportunity to learn.But rather than being a trick of positive psychology, we prove this view is (in the narrow sense of our formalism, anyway) mathematically optimal.

Dual value implementation.
Value initialization and tie breaking.The initial value E0 for π * E can be arbitrary, with the limit E0 > 0. In theory E0 does not change π * E 's long term behavior, but different values will change the algorithm's short-term dynamics and so might be quite important in practice.By definition a pure greedy policy, like π * E , cannot handle ties.There is simply no mathematical way to rank equal values.Theorems 3 and 2 ensure that any tie breaking strategy is valid, however, like the choice of E0, tie breaking can strongly affect the transient dynamics.Viable tie breaking strategies taken from experimental work include, "take the closest option", "repeat the last option", or "take the option with the highest marginal likelihood".We do suggest the tie breaking scheme is deterministic, which maintains the determinism of the whole theory.See Information value learning section below for concrete examples both these choices.
The rates of exploration and exploitation.In Theorem 4 we proved that ππ inherits the optimality of policies for both exploration πE and exploitation πR over infinite time.However this does proof does not say whether ππ will not alter the rate of convergence of each policy.By design, it does alter the rate of each, favoring πR.As you can see in Eq. ??, whenever rt = 1 then πR dominates that turn.Therefore the more likely p(r = 1), the more likely πR will have control.This doesn't of course change the eventual convergence of πE, just delays it in direct proportion to the average rate of reward.In total, these dynamics mean that in the common case where rewards are sparse but reliable, exploration is favored and can converge more quickly.As exploration converges, so does the optimal solution to maximizing rewards.
Re-exploration.The world often changes.Or in formal parlance, the world is non-stationary process.When the world does change, re-exploration becomes necessary.Tuning the size of in ππ (Eq ??) tunes the threshold for re-exploration.That is, once the π * E has converged and so π * R fully dominates ππ, if is small then small changes in the world will allow piE to exert control.If instead is large, then large changes in the world are needed.That is, acts a hyper-parameter controlling how quickly rewarding behavior will dominate, and easy it is to let exploratory behavior resurface.

Bandits.
Design.Like the slot machines which inspired them, each bandit returns a reward according to a predetermined probability.As an agent can only chose one bandit ("arm") at a time, so it must decide whether to explore and exploit with each trial.
We study four prototypical bandits.The first has a single winning arm (p(R) = 0.8, Figure 2A); denoted as bandit I.We expect any learning agent to be able to consistently solve this task.Bandit II has two winning arms.One of these (arm 7, p(R) = 0.8) though higher payout than the other (arm 3, p(R) = 0.6).The second arm can act as a "distractor" leading an to settle on this suboptimal choice.Bandit III also has a single winning arm, but the overall probability of receiving any reward is very low (p(R) = 0.02 for the winning arm, p(R) = 0.01 for all others).Sparse rewards problems like these are difficult to solve and are common feature of both the real world, and artificial environments like Go, chess, and class Atari video games (45)(46)(47).The fourth bandit (IV) has 121 arms, and a complex randomly generated reward structure.Bandits of this type and size are probably at the limit of human performance (48).
World model and distance.All bandits share a simple basic common structure.The have a set of n-arms, each of which delivers rewards in a probabilistic fashion.This lends itself to simple discrete n-dimensional world model, with a memory slot for each arm/dimension.Each slot then represents the independent probability of receiving a reward (Supp.Fig 4A).
The Kullback-Leibler divergence (KL) is a widely used information theory metric, which measures the information gained by replacing one distribution with another.It is highly versatile and widely used in machine learning (? ), Bayesian reasoning (23,29), visual neuroscience (29), experimental design (69), compression (70? ) and information geometry (71), to name a few examples.KL has seen extensive use in reinforcement learning.
Itti and Baladi ( 29) developed an approach similar to ours for visual attention, where our information value is identical to their Bayesian surprise.Itti and Baladi (2009) showed that compared to range of other theoretical alternative, information value most strongly correlates with eye movements made when humans look at natural images.Again in a Bayesian context, KL plays a key role in guiding active inference, a mode of theory where the dogmatic central aim of neural systems is make decisions which minimize free energy (14,23).
Let E represent value of information, such that E := KL(M t+dt , Mt) (Eq. 6) after observing some state s.
Axiom ?? is satisfied by limiting E calculations to successive memories.Axiom ??-?? are naturally satisfied by KL.That is, E = 0 if and only if M t+dt = Mt and E ≥ 0 for all pairs (M t+dt , Mt).
To make Axiom 2 more concrete, in Figure 5 we show how KL changes between a hypothetical initial distribution (always shown in grey) and a "learned" distribution (colored).For simplicity's sake we use a simple discrete distribution representing a 10-armed bandit, though the illustrated patterns hold true for any pair of appropriate distributions.In Figure 5C we see KL increases substantially more for a local exchange of probability compared to an even global re-normalization (compare panels A. and B.).
Initializing ππ.In these simulations we assume that at the start of learning an animal should have a uniform prior over the possible actions A ∈ R K .Thus p(a k ) = 1/K for all a k ∈ A. We transform this uniform prior into the appropriate units for our KL-based E using Shannon entropy, E0 = K p(a k ) log p(a k ).
In our simulations we use a tie breaking "right next" heuristic which keeps track of past breaks, and in a round robin fashion iterates rightward over the action space.
Reinforcement learning.Reinforcement learning in all agent models was done with using the TD(0) learning rule (8) (Eq.7).Where V (s) is the value for each state (arm), Rt is the return   for the current trial, and α is the learning rate (0 − 1].See the Hyperparameter optimization section for information on how α chosen for each agent and bandit. V (s) = V (s) + α(Rt − V (s) [7] The return Rt differed between agents.Our dual value agent, and both the variations of the e-greedy algorithm, used the reward from the environment Rt as the return.This value was binary.The Bayesian reward agent used a combination of information value and reward Rt = Rt + βEt, with the weight β tuned as described below.
Hyperparameter optimization.The hyperparameters for each agent were tuned independently for each bandit using a modified version of Hyperband (72).For a description of hyperparameters seen Table 1, and for the values themselves Table ??.Information value as a dynamic programming problem.To find greedy dynamic programming (8,42) answers we must prove our memory M has optimal substructure.By optimal substructure we mean that M can be partitioned into a small number, collection, or series of memories, each of which is itself a dynamic programming solution.In general by proving we can decompose some optimization problem into a small number of sub-problems whose optimal solution are known, or easy to prove, it becomes trivial to prove that we can also grow the series optimally.That is, proving optimal sub-structure nearly automatically allows for proof by induction (42).
Theorem 1 (Optimal substructure).Assuming transition function δ is deterministic, if V * π E is the optimal information value given by πE, a memory M t+dt has optimal substructure if the the last observation st can be removed from Mt, by Proof.Given a known optimal value V * given by πE we assume for the sake of contradiction there also exists an alternative policy πE = πE that gives a memory Mt−dt = M t−dt and for which V * t−dt > V * t−dt .To recover the known optimal memory Mt we lift Mt−dt to Mt = f ( Mt−dt , st).This implies V * > V * which in turn contradicts the purported original optimality of V * and therefore πE.
Bellman solution.Armed with optimal substructure of M we want to do the next natural thing and find a recursive Bellman solution to maximize our value function for F (Eq. 1).(A Bellman solution of F is also a solution for E (Eq.2).We do this in the classic way by breaking up the series for F into an initial value F0, and the remaining series in the summation.We can then apply this same decomposition recursively (Eq 3) to arrive at a final "twp-step" or recursive form which is shown Eq. 8).
A greedy policy explores exhaustively.To prevent any sort of sampling bias, we need our exploration policy πE (Eq.3) to visit each state s in the space S. As our policy for E is a greedy policy, proofs for exploration are really sorting problems.That is if a state is to be visited it must have highest value.So if every state must be visited (which is what we need to prove to avoid bias) then under a greedy policy every state's value must, at one time or another, be the maximum value.We assume implicitly here the action policy πE can visit all possible states in S. If for some reason πE can only visit a subset of S, then the following proofs apply only to exploration of that subset.
To begin our proof, some notation.Let Z be the set of all visited states, where Z0 is the empty set {} and Z is built iteratively over a path P , such that Zt+ = {s|s ∈ P and s ∈ Zt}.As sorting requires ranking, we also need to formalize ranking.To do this we take an algebraic approach, are define inequality for any three real numbers (a, b, c) (Eq.9).
Theorem 2 (State search: breadth).A greedy policy π is the only deterministic policy which ensures all states in S are visited, such that Z = S.
To complete the proof, assume that some policy πE = π * E .By definition policy πE can be any action but the maximum, leaving k − 1 options.Eventually as t → T the only possible swap is between the max option and the kth, but as we have already proven this is impossible as long as Axiom 5 holds.Therefore, the policy πE will leave at least 1 option unexplored and S = Z.Theorem 3 (State search: depth).Assuming a deterministic transition function Λ, a greedy policy πE will resample S to convergence at Et ≤ η.
Each time π * E visits a state s, so M → M , F (M , a t+dt ) < F (M, at) In Theorem 2 we proved only a deterministic greedy policy will visit each state in S over T trials.
By induction, if π * E will visit all s ∈ S in T trials, it will revisit them in 2T , therefore as T → ∞, E → 0.

Optimality of ππ.
In the following section we prove two things about the optimality of ππ.First, if πR and/or πE had any optimal asymptotic property for value learning before their inclusion into our scheduler, they retain that optimal property under ππ.Second, we use this Theorem to show if both πR and πE are greedy, and ππ is greedy, then Eq 5 is certain to maximize total value.This is analogous to the classic activity selection problem (42).

Independent policy convergence.
Theorem 4 (Independence policy convergence under ππ).Assuming an infinite time horizon, if πE is optimal and πR is optimal, then ππ is also optimal in the same senses as πE and πR.
Proof.The optimality of ππ can be seen by direct inspection.If p(R = 1) < 1 and we have an infinite horizon, then πE will have a unbounded number of trials meaning the optimally of P * holds.Likewise, E < η as T → ∞, ensuring piR will dominate ππ therefore πR will asymptotically converge to optimal behavior.

P R E P R I N T
In proving this optimality of ππ we limit the probability of a positive reward to less than one, denoted by p(Rt = 1) < 1.Without this constraint the reward policy πR would always dominate ππ when rewards are certain.While this might be useful in some circumstances, from the point of view πE it is extremely suboptimal.The model would never explore.Limiting p(Rt = 1) < 1 is reasonable constraint, as rewards in the real world are rarely certain.A more naturalistic to handle this edge case is to introduce reward satiety, or a model physiological homeostasis (61,62).
Optimal scheduling for dual value learning problems.In classic scheduling problems the value of any job is known ahead of time (18,42).In our setting, this is not true.Reward value is generated by the environment, after taking an action.In a similar vein, information value can only be calculated after observing a new state.Yet Eq. 5 must make decisions before taking an action.If we had a perfect model of the environment, then we could predict these future values accurately with model-based control.In the general case though we don't what environment to expect, let alone having a perfect model of it.As result, we make a worst-case assumption: the environment can arbitrarily change-bifurcate-at any time.This is, it is a highly nonlinear dynamical system (73).In such systems, myopic control-using only the most recent value to predict the next value-is known to be an robust and efficient form of control (43).We therefore assume that last value is the best predictor of the next value, and use this assumption along with Theorem 4 to complete a trivial proof that Eq. 5 maximizes total value.Optimal total value.If we prove ππ has optimal substructure, then using the same replacement argument (42) as in Theorem 4, a greedy policy for ππ will maximize total value.Theorem 5 (Total value maximization of ππ).ππ must have an optimal substructure.Proof.Recall: Reinforcement learning algorithms are embedded in Markov Decisions space, which by definition have optimal substructure.
Recall: The memory M has optimal substructure (Theorem 1.
Recall: The asymptotic behavior of πR and πE are independent under ππ (Theorem 4 If both πR and πE have optimal substructure, and are asymptotically independent, then ππ must also have optimal substructure.

Fig. 2 .
Fig. 2. Bandits.Reward probabilities for each arm in bandit tasks I-IV.Grey dots highlight the optimal (i.e., highest reward probability) arm.See main text for a complete description.

Fig. 3 .
Fig. 3. Regret and total accumulated reward across models and bandit task.Median total regret (left column) and median total reward (right column) for simulations of each model type (N = 100 experiments per model).See main text and Table 1 for description of each model.Error bars in all plots represent median absolute deviation.

Fig. 4 .
Fig. 4. A world model for bandits.B. Example of a single world model suitable for all bandit learning.B Changes in the KL divergence-our choice for the distance metric during bandit learning-compared to changes in world model, as by measured the total change in probability mass.

Fig. 5 .
Fig. 5.An example of observation specificity during bandit learning.A. A initial (grey) and learned (distribution), where the hypothetical observation s increases the probability of arm 7 by about 0.1, and the expense of all the other probabilities.B. Same as A except that the decrease in probability comes only from arm 8. C. The KL divergence for local versus global learning.
Exploration and value dynamics. .While agents earned nearly equivalent total reward in Bandit I (Fig 3, top row), their exploration strategies were quite distinct.In Supp.Fig 6B-D) we compare three prototypical examples of exploration, for each major class of agent: ours, Bayesian, and E-greedy for Bandit I.In Supp.Fig 6A) we include an example of value learning value learning in our agent.

Table 1 . Artificial agents.
Peterson et al.November 5, 2019 | 3 P R E P R I N T