Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function

As reinforcement learning algorithms are being applied to increasingly complicated and realistic tasks, it is becoming increasingly difficult to solve such problems within a practical time frame. Hence, we focus on a \textit{satisficing} strategy that looks for an action whose value is above the aspiration level (analogous to the break-even point), rather than the optimal action. In this paper, we introduce a simple mathematical model called risk-sensitive satisficing ($RS$) that implements a satisficing strategy by integrating risk-averse and risk-prone attitudes under the greedy policy. We apply the proposed model to the $K$-armed bandit problems, which constitute the most basic class of reinforcement learning tasks, and prove two propositions. The first is that $RS$ is guaranteed to find an action whose value is above the aspiration level. The second is that the regret (expected loss) of $RS$ is upper bounded by a finite value, given that the aspiration level is set to an"optimal level"so that satisficing implies optimizing. We confirm the results through numerical simulations and compare the performance of $RS$ with that of other representative algorithms for the $K$-armed bandit problems.


Introduction
Reinforcement learning (RL), a framework for learning and control in which agents search for proper actions in an environment through trial and error, has witnessed rapid development in recent years, as evidenced by the super-human performances of deep Q-networks (DQN) 1 in video game playing and AlphaGo 2 in the game of Go.Moreover, the application range of RL extends not only to more complicated tasks on computers but also to the control of robots 3 and unmanned aerial vehicles (UAVs) 4 in the real world.
As RL algorithms are being applied to increasingly complicated and realistic tasks, the limits of sensors, processors, and actuators of agents are posing serious obstacles for conventional optimization algorithms.Simon proposed the notion of bounded rationality as the principle underlying agents' behavior under resource limits 5 .A bounded rational agent may appear to behave irrationally, but by considering the limits and constraints, the agent's behavior can be understood as rational.Bounded rationality has attracted considerable attention in recent years.Computational rationality 6 , which has been claimed to integrate the three fields of neuroscience (brain), cognitive science (mind), and artificial intelligence (machine) 7 , is an updated form of bounded rationality.Further, it has been proposed that abstraction and hierarchy, which have been considered to enable flexible and efficient cognition of humans 8 , result from the above-mentioned limitations and are bounded rational 9 .
The representative decision making policy in the theory of bounded rationality is satisficing 10,11 .Satisficing agents do not keep searching for the optimal action; instead, they stop searching when an action whose quality is above a certain level (aspiration) is found.The satisficing strategy has not attracted much attention in reinforcement learning, except for a few studies 12,13 (to be discussed later).In previous studies 14,15 , one of the authors proposed a simple satisficing value function called risk-sensitive satisficing (RS) and empirically validated its effectiveness through numerical simulations of reinforcement learning tasks.
In this paper, we apply RS to the K-armed bandit problems, which constitute the most basic class of reinforcement learning tasks, and prove two propositions.First, we prove that RS is guaranteed to find a satisfactory action: if the RS agent chooses an action in each trial and the number of trials is sufficient, the agent can stably choose an action whose value is above the aspiration level.Second, we prove the finiteness of the regret of RS.In general, the performance of algorithms in the K-armed bandit problems is measured by how small their regret (expected loss) is.It is known that the regret increases at least in the logarithmic order with the number of trials 16 .Therefore, the regret increases infinitely as the trials are repeated.However, we prove that if a small amount of information on the reward distributions is available so that the aspiration level is set to an "optimal level" (hence, satisficing entails optimizing), then the regret of RS is upper bounded by a finite value.We confirm these results by numerical simulations and compare the performance of RS with that of other representative algorithms for the K-armed bandit problems.Finally, we conclude the paper with a discussion on the possible applications of RS and the theoretical significance of this work.

K-armed Bandit Problems
The K-armed bandit problems that we deal with in this paper are as follows.Let there be K actions {a 1 , a 2 , . . ., a K } that lead to a reward of 1 or 0 according to the reward probabilities {p 1 , p 2 , . . ., p K }, which are unknown to the agent.If the agent chooses action a i , it acquires a reward of 1 with probability p i or a reward of 0 with probability 1 − p i .The goal of the repetition of choice is maximization of the expected accumulated rewards, which is measured by minimization of regret (the expected cumulative loss).a * i denotes the action with the maximal reward probability (i.e., p i * = max i p i ).The regret when the n-th step (one step means one trial) ends is defined as follows.
where n i (n) is the number of times action a i is chosen from the first to the n-th step (simply written as n i when the number of steps is not explicitly indicated) and E[ • ] is the expectation.Regret represents the expected loss, i.e., "how inferior the cumulative expected reward from the actual chosen actions is to the cumulative expected reward when the optimal action continues to be chosen from the first step?"The smaller the regret, the better is the performance of the algorithms.The minimum value of the regret is zero when the optimal action has been chosen in all the steps.It has been proven that the regret increases at least in O(log n) with the number of steps n 16 .
As for action selection by the agent, the basic policy is to take the action with the highest value (the greedy method).The basic valuation of action a i is based on its mean reward: where n r i is the number of times a i is chosen and the reward r is acquired.n i , i.e., the number of times the action a i is chosen, satisfies n i = n 1 i + n 0 i and n = ∑ K i=1 n i .Under the greedy method with the mean reward valuation, if there is a non-optimal action a i (i = i * ) that has a high value in early trials, there is a risk of a i being chosen all along.Each of the other actions must be tried for an appropriate number of times so that the optimal action is found in a timely manner.Merely choosing the action with the highest value based on the accumulated knowledge (exploitation) does not suffice, and various actions must be tried (exploration).Various algorithms have been proposed to balance exploitation and exploration.

Models of Satisficing
We introduce two models of satisficing at the levels of policy and value function.The policy model follows the standard description of satisficing.The second model is the risk-sensitive value function that we analyze and test in this paper.The former is tested through simulations for comparison with the latter.

Policy Satisficing (PS) Model
A standard definition of satisficing is to keep exploring until an action whose value is above the aspiration level R is found and to then stop searching and keep choosing the action (exploit).Satisficing, unlike optimization, can reduce the search cost because it does not involve searching for all actions and deciding on the optimal action.This is formulated as a policy (of reinforcement learning) as follows.If there exists at least one action whose mean reward is above the aspiration level R, exploitation (following the greedy method) is executed.Otherwise, when the mean reward of all the actions is below the aspiration level R, an action is randomly chosen.We refer to this algorithm as policy satisficing (PS).

Risk-sensitive Satisficing (RS) Value Function
One of the authors has proposed a value function called risk-sensitive satisficing (RS) that realizes satisficing action selection behavior when operated under the greedy policy 14,15 (see Supplementary Information for its relationship with other models).Before introducing the model, we first define the difference between the mean reward E i of action a i and the aspiration level R: If there exists a positive δ i , then the agent will choose such a i and be satisfied; otherwise, it will be unsatisfied.RS is defined as follows 14 : This value is used under the greedy policy: the agent chooses the action a i with the maximal RS i value.
RS integrates two risk-sensitive satisficing behaviors.When unsatisfied, RS is risk-seeking, leading to optimistic exploration.If δ i < 0 for all i, then actions with smaller n i are prioritized.Let R = 0.7 and let there be two unsatisfactory actions a 1 and a 2 with E 1 = 0.4 < E 2 = 0.6 and n 1 = 7, n 2 = 2.Then, RS 1 = −2.1 < RS 2 = −0.2;hence, a 2 is chosen.This preference of a less tried action can be interpreted as the optimistic expectation of the action's actual reward probability p i being set above R.There might be some p i > R; however, thus far, E i < R for all the actions.In terms of looking for a satisfactory action, it is rational to try actions with smaller n i .This accords with the motto "optimism in the face of uncertainty," which is considered a general and rational exploration strategy in reinforcement learning 17 .The UCB model described later implements this idea 18 .
When satisfied, RS is risk-averse, performing pessimistic exploitation.If there is only one a i for which δ i is positive, the agent will keep choosing it.If there are multiple actions with positive δ i , then the actions with larger n i are prioritized.Let R = 0.3, and let there be two satisfactory actions a 1 and a 2 with E 1 = 0.4 < E 2 = 0.6 and n 1 = 7, n 2 = 2 that are equivalent to the example above.Then, RS 1 = 0.7 > RS 2 = 0.6; hence, a 1 is chosen.In this case, a more tried action is preferred.This can be interpreted as the pessimistic expectation of the action's actual reward probability p i being set below R. It is possible that a i is a spuriously satisfactory action with E i > R; however, p i < R. In terms of looking for a truly satisfactory action and avoiding spuriously satisfactory ones, it is rational to try actions with E i > R for a larger n i .

Setting of the Aspiration Level
The aspiration level R defines the boundary between satisfactory and unsatisfactory, analogous to the break-even point between gain and loss or the neutral reference outcome in prospect theory 19 .It can be set according to the internal need for it or its knowledge of the environment.As an ecological example, let the agent be an animal, and let the rewards 1 and 0 represent the presence and absence of food.If the action is to look for food at a feeding ground from among multiple grounds and the agent has to obtain food around once every two days for survival, then R would be 0.5 or higher.
Optimization can be viewed as a special case of satisficing.If R lies between the two reward probabilities of the optimal and second-optimal actions, then satisficing above R means optimizing.Let us call such R "an optimal aspiration level".Let the highest reward probability be p 1st and the second-highest one be p 2nd .R can be set optimally as follows: It is known that the regret increases at least in O(log n) with the number of steps n 16 .This is the result of assuming no knowledge of the agent on the reward distribution.By relaxing this assumption and allowing R to be set as in Eq. 5, it will be shown that the regret is upper bounded by a finite value as in Proposition 2 described later.Note that having an optimal aspiration level does not make a K-armed bandit problem trivial.Even if we know a point between the optimal and second-optimal actions, we do not know exactly which action is optimal.Efficient identification of such an action is not trivial.In the next section, RS will be compared in terms of its performance with other algorithms, one of which needs some similar information on the reward distribution to be optimal.

Data availability
All data are generated by numerical simulations and they have all been reported in the paper.

Analysis
We perform theoretical analysis of the basic satisficing and optimizing properties of RS.First, in Proposition 1, we prove that RS can stably choose actions above the aspiration level after a sufficient number of steps.Second, in Proposition 2, we prove that the regret of RS is upper bounded when an optimal aspiration level is given and satisficing becomes optimizing.

Guarantee of Satisficing
In the proof of Proposition 1, we adopt symbols clearly indicating the step number (s) and the chosen action (a i ) as follows.Both of the following represent values after s steps: the mean reward and the RS value Proposition 1 (Theoretical Guarantee of Satisficing).Let p i be the reward probability of action a i (i = 1, 2, . . ., K).Let A U be the set of actions whose reward probability is not smaller than the aspiration level R, and let A L be the set of actions whose reward probability is smaller than R.
where A U is supposed to be a non-empty set.Then, the following holds for RS.
After a sufficient number of steps, a satisfactory action a i with p i > R will be always chosen, and this state is stable.
In other words, by letting P(A) be the probability that event A will occur, P arg max Subsequently, by N j = s arg max a RS(a, s) = a j , we denote the set of steps in which action a j is chosen.Let #N be the number of elements in set N. First, we prove two claims. Proof.
) is constant for s greater than or equal to some number.This is a contradiction; hence, we have #N i = ∞.(⇒) Suppose that i ∈ I L and #N i = ∞.By the law of large numbers, for any positive number ε, there exists some S such that we have As s → ∞, we have Proof.(Claim B) We assume that for any i ∈ I U , #N i < ∞.Then, for any i ∈ I U , RS(a i , s) is constant for any s greater than or equal to some number.Furthermore, for some j ∈ I L , we have #N j = ∞.Hence, by Claim A, we have However, the following statements contradict each other: (i) RS(a j , s) → −∞, (ii) ∀i ∈ I U , RS(a i , s) = const.for any s greater than or equal to some number.Hence, we obtain Now, the following formula holds.
Therefore, we must have P(∀i Proposition 1 (again).

P arg max
4/15 Proof.(Proposition1) By Claim B, we have ∃k ∈ I U , #N k = ∞.By the law of large numbers, for any positive number ε, there exists some S such that we have Hence, we have P for sufficiently large s, RS(a k , s) > 0 > 1−ε.Since ε is arbitrary, we obtain P for sufficiently large s, RS(a k , s) > 0 = 1.
Here, we assume that there exists i ∈ I L such that #N i = ∞.Then, we may have RS(a i , s) → −∞ by Claim A. On the other hand, #N i < ∞ follows from RS(a i , s) → −∞ because RS(a k , s) > 0 for any sufficiently large s.However, #N i = ∞ and #N i < ∞ contradict each other, which means that the initial assumption must be false.Hence, for any i ∈ I L , P(#N i < ∞) = 1 holds.Therefore, the results obtained are summarized as ∃k ∈ I U , P(#N k = ∞) = 1 and ∀i ∈ I L , P(#N i < ∞) = 1.From these results, the following follows immediately.P arg max

Theoretical Analysis of Regret
We prove that RS is upper bounded by a finite value when the level R is set to the optimal aspiration level.
Proposition 2 (Finiteness of Regret of RS).Let the highest reward probability of all the actions be p 1 and the second-highest reward probability be p 2 .Further, we set R as R = (p 1 + p 2 )/2 (an optimal aspiration level).Then, the following holds for RS: "There exists a monotonically increasing function f (s) for step number s such that regret(s where M is constant.Thus, regret(s) < M".
We conceived the following proof by referring the papers [20][21][22] on TOW (tug-of-war) dynamics model (hereinafter simply referred to as TOW).TOW is similar to RS (See Supplementary Information for the similarities and differences between RS and TOW).However, in their paper, the analysis of the finiteness of the regret by TOW was strictly limited to cases in which there are only two actions and the variances of the reward probabilities are equal.In the case of the bandit problems with the reward following the Bernoulli distributions, equal variance implies p 1 = p 2 or p 2 = 1 − p 1 .(Let V i be the variance of action a i . ) Thus, the equal variance is a strong assumption.Here, we generalize the proof to prove finite regret with K arms (K ≥ 2) and without assuming equal variance.
i , respectively, where holds, where X i, j = 1 or 0, indicating the reward when action a i was chosen in the j-th time. V Since (p

5/15
By Proposition 1, if the step number s is sufficiently large, then n 1 (s) → s with probability 1.Hence, V By Eq. ( 16) and the central limit theorem, ∆RS i (s) follows the normal distribution with expectation E[∆RS i (s)] and variance Here, Q(x) is the Q-function, which represents the tail distribution function of the standard normal distribution.Thus, be the probability that action a i is chosen in the (n + 1)-th step.
Then, P[s = n + 1, I = i] is given by where we set By using the Chernoff bound Q(x) ≤ (1/2) exp(−x 2 /2), we evaluate the upper bound of the regret. Therefore, This concludes the proof.

Empirical Verification
We verify the proven properties through simulations.As in Proposition 2, R = (p 1 + p 2 )/2, where p 1 > p 2 > p i (i = 1, 2).All the results below are the averaged results of 1,000 simulations.As an additional performance index, we consider accuracy, which is the proportion of the simulations in which the algorithm chose the optimal action in each step.Thus, the accuracy in the t-th step is as follows.accuracy = (Number of times action a 1 with the highest reward probability p 1 is chosen in the t-th step) / (Total number of simulations).
First, we test whether the difference in reward probabilities can be detected, even if the difference is small, when the optimal aspiration level is set for RS.We test it with K = 2 where (p 1 , p 2 ) = (0.51, 0.49), (0.501, 0.499).The result is shown in Fig. 1.The dotted line at the top in Fig. 1 (b) represents the upper bound of the regret shown by Proposition 2. We see that the accuracy nearly reaches 1 after 10 6 steps, even if the difference is only 0.002 as in (0.501, 0.499).Moreover, we see that the regret does not exceed the upper bound (Eq.( 27)) calculated by Proposition 2.
Next, we conduct simulations to confirm the propositions with K = 10.The reward probability of each action is generated uniformly randomly from [0, 1].The result is shown in Fig. 2. We can see that the accuracy converges to 1 and the regret does not exceed the upper bound (Eq.( 27)) calculated by Proposition 2. Here, the calculated upper bound of the regret for K = 10 is considerably higher than the actual regret compared with the case of K = 2.As we evaluate the probability of choosing action a i only by comparing a i with action a 1 having the highest reward probability as shown in Eq. ( 22) in the proof of Proposition 2, the probability of choosing a i is increasingly overestimated as the number of actions increases.

Comparison with Other Algorithms
Here, we clarify the performance and properties of RS by comparing it with some representative algorithms for the K-armed bandit problems, namely UCB1-Tuned and ε n -greedy 18

UCB1-Tuned
Upper confidence bound (UCB) is an algorithm based on the idea that the value of relatively less tried actions (more uncertain) is potentially high, similar to RS's risk-seeking evaluation when unsatisfied 18 .The regret of UCB is guaranteed to increase in the logarithmic order, which is the theoretical limit 16 .We include the result of UCB1-Tuned (hereinafter referred to as UCB1T), which shows better performance compared to UCB1.

UCB1T(a
Here, V i (n i ) = v i + 2 ln n/n i , and v i is the variance of the reward from choosing action a i .Further, 1/4 is the upper bound of the variance of the random variable following the binomial distribution.In the algorithm, the action with the highest UCB1T value is chosen (the greedy method).The first term E i of UCB1T, which is the mean reward, represents the already acquired knowledge (and its exploitation), whereas the second term, which decreases as action a i is tried more, expresses the (un-)reliability of E i (which leads to exploration).When n i = 0, the second term cannot be calculated, but in the first K steps, each action is chosen once so that the value of the second term for all the actions is subsequently finite.To set the level R such that satisficing implies optimization, it is necessary to have some point in the interval between the highest and second-highest reward probabilities, usually unknown to the agent.Thus, having such "optimal" R is a type of "cheating".However, when such information is available, it should be utilized well, and RS does so.
Furthermore, there is another algorithm, namely ε n -greedy 18 , which requires similar information for optimal performance.In this algorithm, the probability of random action selection, ε n , is gradually reduced by annealing so that the regret of ε n -greedy is guaranteed to be of the logarithmic order.It starts with maximal exploration (random action selection) and then gradually shifts to more exploitation as the information of the environment gets accumulated.In ε n -greedy, there are two parameters c and d that are set as c > 0 and 0 < d < 1.When there are K arms, the stepwise decreasing sequence ε n ∈ (0, 1], n = 1, 2, . . . is defined as follows: The agent chooses action a i with the highest mean reward with probability 1 − ε n , and it chooses a random action with probability ε n for n = 1, 2, . . .Let p 1st be the highest reward probability, and define ∆ i = p 1st − p i .Then, the parameter d needs to satisfy 0 Further, min ∆ i = p 1st − p 2nd needs to be known in advance.Thus, some information about the reward probabilities is required, as in the case of RS with the optimal aspiration level.In addition, the performance of ε n -greedy is sensitive to the value of the parameter c > 0, and it is difficult to find the optimal value of c 18 . On the other hand, determining the optimal aspiration level R for RS may be easier.It does not require a parameter like c, and (p 1st + p 2nd )/2 is sufficient.More generally, it is sufficient to obtain the interval [p 2nd , p 1st ] or the value of any point within the interval.

Existing Satisficing Models
Here, we introduce the existing satisficing models and briefly explain the difference between those models and RS.First, the framework that is the closest to ours is that of Bendor et al. on the heuristics of satisficing 12 , which analyzes the two-armed bandit problems when the rewards are Bernoulli distributed.They mainly analyzed the limiting behavior of the policy model similar to PS.Their model is different from PS in that it gives a probability parameter of switching actions with a certain probability (not always), when unsatisfied.Therefore, the performance of their model is lower than that of PS.
The most recent and comprehensive study was conducted by Reverdy et al. 13 They decomposed satisficing into "satisfy" and "suffice" (from which the word "satisfice" is formed) and presented general problem settings that include the standard bandit problems and algorithms with optimal order.As their algorithm is an adaptation of the standard UCB 18 , the difference between RS and their algorithm is similar to the difference between RS and UCB as described above.Furthermore, their analysis is limited to the bandit problems where the reward distributions are Gaussian.In their study, they extended the concept of regret and developed an algorithm that searches for actions that exceed the aspiration level with probability (1 − δ ).They proved the finiteness of the regret for their algorithm when δ > 0. However, it should be noted that in their study, the definition of regret is changed.Specifically, the regret of their algorithm is calculated according to whether or not the expected reward exceeds the aspiration level with probability (1 − δ ), and the definition that regards the regret occurring with probability δ as zero is adopted.If δ = 0, their regret is calculated according to whether the expected reward always exceeds the aspiration level or not; therefore, it becomes the same framework as that of the ordinary bandit problems.In such cases, the regret of their algorithm increases in the logarithmic order, which is the theoretical limit, and it does not become finite.On the other hand, RS can achieve the finite regret without changing the definition of regret.Therefore, the purposes and problem settings are different in our study and their study.
According to the above-mentioned discussion, it is difficult to compare our study with other satisficing algorithms for reinforcement learning proposed in previous studies because the purposes and frameworks are different.It is sufficient to compare our approach with PS and UCB1.Accordingly, the other algorithms will not be handled directly hereafter.

Performance Comparison
We compare the performance of UCB1T, PS, ε n -greedy, and RS with K = 100 through numerical simulations.Furthermore, the reward probabilities are uniformly randomly selected from [0, 1], and the average is over 1,000 simulations.As mentioned above, it is difficult to determine the parameter c of ε n -greedy.In this simulation, the regret of ε n -greedy in the 10,000-th step is taken as a reference.It is empirically found by a long parameter sweep such that the regret of ε n -greedy in the 10,000-th step is minimized at around c = 1 × 10 −5 .Hence, the results of c = 1 × 10 −6 , 1 × 10 −5 , 1 × 10 −4 are shown as comparison targets.We set d as d = p 1st − p 2nd .As for RS and PS, we set the aspiration level R to an optimal level, R = (p 1st + p 2nd )/2, so that we can evaluate the efficiency when satisficing implies optimization.
The results are shown in Fig. 3.As for accuracy, RS approaches 1 the fastest among these algorithms.As for regret, PS increases rapidly because it randomly chooses actions unless an action whose reward is above R is found.The regret of RS remains small (and bound finitely), whereas UCB1T and ε n -greedy diverge at a logarithmic order.In summary, we can see that RS with the optimal aspiration level R shows better performance than UCB1T, PS, and ε n -greedy.

Analysis of the Expected Change in Value Functions
Here, we qualitatively consider why RS with the optimal aspiration level R performs better than the other algorithms.Let us consider how the value of RS in the n-th step changes when action a i is chosen in the (n + 1)-th step.In the following RS formula, n 1 i (n) is the number of times a reward of 1 is obtained in the choice of action a i from the first to the n-th step.In the (n + 1)-th step, the value of RS changes with probability p i to whereas it otherwise changes with probability (1 Let ∆RS(a i , n) = RS(a i , n + 1) − RS(a i , n).Then, the expected value of the change, E[∆RS(a i , n)], is as follows: Thus, we see that the following relationships hold in any step: Let R be set to an optimal level.Then, relationship 35 means that once the optimal action a i is chosen, RS(a i ) will keep increasing on average, and it will continue to be chosen.On the other hand, relationship 36 means that if a non-optimal action a j has the highest RS value, and continues to be chosen for a while, then the value keeps decreasing on average.The value for other actions remains invariant.Therefore, at some point, another action than a j will start to be chosen.Further, note that the RS value decreases at an average rate of p j − R. Therefore, on average, the lower the reward probability of an action, the faster the action will stop being chosen, and another action will start being chosen.
To clarify the idiosyncrasies of RS, we carry out similar analyses for other value functions.First, let us analyze the mean reward.The value function is whereas the values for other actions do not change.Further, E[∆Q(a i , n)] is positive if p i > E i and negative if p i < E i , and both cases may occur regardless of the reward probability p i because E i is a variable, in contrast to the constant R for RS.If action a i is chosen for a sufficient number of times, p i ≈ E i holds.Then, it leads to E[∆Q(a i , n)] ≈ 0, and Q(a i , n) remains nearly unchanged.This implies that there is a possibility that a non-highest action keeps to be chosen (trapped into a local optimum).
Let us consider the simplest example where there are only two actions (with p 1 > p 2 ), and choosing the optimal action a 1 does not give much rewards, leading to E 1 < p 2 and E 1 < E 2 .As n 2 increases, E 2 converges to E 2 ≈ p 2 , and the relationship of E 1 < E 2 becomes fixed because of E 1 < p 2 .This leads to a 2 being chosen constantly.To avoid the local optima, ε n -greedy prevents a non-highest action from being continuously chosen by randomly choosing actions with probability ε n .With the mean reward, unlike RS, we cannot say that the smaller the reward probability of the action chosen once, the faster on average is the switching of the agent to choose another action.
Next, let us analyze UCB1, which is the simplest algorithm in the UCB family.
When action a i is chosen, the expected change in the UCB1 value is whereas the expected change of non-chosen action a j is as follows: In Eq. (39), the first term is the same as that in Eq. (37).In Eq. (39), the second and third terms approach zero if action a i continues to be chosen.Hence, if we consider only Eq. ( 39), there is a possibility that the non-highest action continues to be chosen, as with Eq. (37).However, in UCB1, the value function of non-chosen action a j also changes, as in Eq. ( 40).Moreover, we can see that the value of the non-chosen action increases infinitely because of the second term of Eq. ( 38).As a result, a non-highest action does not continue to be chosen.In Eq. ( 39), the first term is positive if p i > E i and negative if p i < E i , and both cases may occur regardless of the reward probability p i because E i is a variable, as it is for Q above.On the other hand, the second term between the parentheses is negative if n ≥ 3, which results from the fact that f (x) = (ln x)/x monotonically decreases with x > e (> 2).As a result, E[∆UCB1(a i , n)] may be positive or negative, regardless of the reward probability.Therefore, UCB1 does not have the property of RS whereby the action with a lower reward probability will be switched from earlier.
Based on the analyses presented above, let us reconsider the form of Starting from the most basic value function of the mean reward, E i , RS is formed through two operations, (•) − R and n i (•).If it is merely δ i , the value function δ i works exactly as the original E i under the greedy policy.On the other hand, if only n i (•) is applied, the value function is n i E i = n 1 i , and it is a special case of RS with R = 0 where any action is satisfactory.With n i E i , the agent will continue to choose the first action that gives a reward of 1.By applying the two operations, we acquire the property of E[∆RS(a i , n)] = p i − R, the constant change in the RS value, regardless of the step number n.Therefore, the RS value of an unsatisfactory action (with the reward probability below the aspiration level) constantly decreases on average; as a result, the action will cease to be chosen at some point.Furthermore, we can say that the smaller the reward probability of the action chosen once, the faster on average is the switching of the RS agent to the choice of other actions.As shown above, UCB and ε n -greedy have no such property.Therefore, this property is considered to be one of the reasons why the performance of RS using the optimal aspiration level is superior to that of other basic algorithms.

Discussion
In this paper, we introduced a simple model called RS that implements a satisficing strategy for the K-armed bandit problems, which constitute one of the most basic classes of reinforcement learning tasks.We proved two propositions.One is that RS is guaranteed to find a satisfactory action with the reward probability above the aspiration level.The other is that the regret (expected loss) of RS is upper bounded by a finite value when an optimal aspiration level (where satisficing implies optimizing) is given.Then, we confirmed the results through numerical simulations and compared the performance of RS with that of other representative algorithms for the K-armed bandit problems.In addition, we analyzed the property of RS relative to other algorithms and validated why RS has its own form.
Except in Proposition 1, we assumed that we can set the aspiration level R to an optimal level.As the optimal aspiration is not always available to the agent, a future research direction would be to develop an algorithm that can learn an optimal aspiration level R online.As a preliminary result, an algorithm that exploits the properties of RS has shown performance comparable to that of Thompson sampling 23 , although it has not been theoretically guaranteed thus far 24 .
There are many other advantages of RS besides those mentioned in this paper.For example, the satisficing behavior is scalable in the sense that its performance does not depend on the scale of the problems, such as the number of actions, but rather on the proportion of satisfactory actions, unlike optimization algorithms 15 .In addition, as RS is a simple value function without assumptions such as the family of reward probability distributions, it can be applied to other reinforcement learning tasks through some straightforward generalization.In fact, it has been shown that the generalized RS can conduct autonomous and efficient searches in a robotic motion learning task in which a robot learns to perform giant swings (acrobot) 14 .
One of the computational advantages of satisficing, compared to optimization, is that it can convert an optimization problem into a decision problem.With RS, the guaranteed satisficing algorithm, and R at a certain level, we can efficiently determine whether there is an action whose value is above R.The decision framework is especially useful when a certain level of reward, rather than the optimal level, is necessary.It also facilitates parallelization.For example, we can set the aspiration levels R 1 , R 2 , . . ., R N to N agents in ascending order, respectively, and make the agents execute a certain task in parallel.If the task succeeds at the level R i and fails at the level R i+1 , we can see that the optimal solution exists somewhere in [R i , R i+1 ], and the interval may be incrementally narrowed down.This is somewhat close to human learning for solving a task.When trying 11/15 to solve a task, we usually do not randomly try and err in a purely bottom-up manner.Instead, we tend to adopt a top-down constraint in our trials, such as trying to run one mile in four minutes.Guaranteed satisficing may lead to reinforcement learning methods that solve tasks somewhat similarly to humans.

Figure 1 .
Figure 1.Simulations of RS with K = 2, where the reward probabilities are (0.51, 0.49) or (0.501, 0.499).(a) Plot of accuracy and (b) plot of regret.The dotted line at the top represents the upper bound of the regret calculated by Proposition 2.

Figure 2 .
Figure 2. Simulations of RS with K = 10, where the reward probabilities are each generated uniformly randomly from [0, 1] in each simulation.(a) Plot of accuracy and (b) regret.The dotted line at the top shows the upper bound of the regret calculated by Proposition 2.