Linear Thompson Sampling Revisited

We derive an alternative proof for the regret of Thompson sampling ( TS ) in the stochastic linear bandit setting. While we obtain a regret bound of order (cid:101) O ( d 3 / 2 √ T ) as in previous results, the proof sheds new light on the functioning of the TS . We leverage on the structure of the problem to show how the regret is related to the sensitivity (i


Introduction
The multi-armed bandit (MAB) framework [Bubeck and Cesa-Bianchi, 2012] formalizes in a synthetic way the exploration-exploitation trade-off in sequential decision-making, where a learner needs to balance between exploiting current estimates to select actions maximizing the reward and exploring actions to improve the accuracy of its estimates.Two popular approaches have been developed to trade off exploration and exploitation: the optimism in face of uncertainty (OFU) principle (see e.g., Agrawal [1995], Auer et al. [2002]), which consists in choosing the optimal action according to upper-confidence bounds on the true values, and the Thompson Sampling (TS) strategy, which randomizes actions on the basis of their uncertainty.In this paper we mostly focus on this second approach.
TS is an general heuristic for decision-making problems characterized by some unknown parameters.The first version of this Bayesian heuristic dates back to Thompson [1933], but it has been rediscovered several times and successfully applied to address the explorationexploitation trade-off in a wide range of problems (see e.g., Strens 2000, Chapelle and Li 2011, Russo and Van Roy 2014).The basic idea is to assume a prior distribution over the unknown parameters and to use the Bayes rule to update it using the samples obtained over time.More precisely, at each time step the learner gathers information by executing the optimal action corresponding to a random parameter sampled from the current posterior distribution.
Related literature.While the Bayesian perspective of TS provides a convenient tool to derive the sampling distribution, the algorithm is still valid under a frequentist approach, i.e., when the true parameter is not a random variable but a fixed parameter.As a result, the regret of TS (i.e., the difference between the rewards collected by the algorithm and the optimal action) has been analyzed both in the Bayesian and in the frequentist setting.In MAB, TS has been shown to achieve optimal performance in the frequentist setting (see e.g., May et al. 2012, Agrawal and Goyal 2012b, Kaufmann et al. 2012, Korda et al. 2013) and the dependency of the regret on its prior has been studied in the Bayesian case by Bubeck and Liu [2013].In more general cases, such as the (generalized) linear bandit and reinforcement learning settings, most of the literature focused on the analysis of the Bayesian regret (see e.g., Russo and Van Roy [2014], Osband and Van Roy [2015], Russo and Van Roy [2016]).Notable exceptions are the analysis of TS in finite MDPs by Gopalan and Mannor [2015] and the study in linear contextual bandit (LB) by Agrawal and Goyal [2012b].In this paper, we focus on LB and draw novel insights on the functioning of TS in this setting.In LB the value of an arm is obtained as the inner product between an arm feature vector x and an unknown global parameter θ .As opposed to the OFU approach, the main technical difficulty in analyzing TS lies in controlling the deviation in performance due to the randomness of the algorithm.Agrawal and Goyal [2012b] leverage on the MAB line of proof (as in Agrawal and Goyal [2012a]) classifying arms as saturated and unsaturated depending on wether their standard deviation is smaller or bigger than their gap to the optimal arm.1 While for unsaturated arms the regret is related to their standard deviation that decreases over time, they prove that TS has a small (but constant) probability to select saturated arms and thus it achieves a regret O d 3/2 √ T .
Contributions.The major contributions of this paper are: 1) Following the intuition of Agrawal and Goyal [2012b], we show that the TS does not need to sample from an actual Bayesian posterior distribution and that any distribution satisfying suitable concentration and anti-concentration properties guarantees a small regret.In particular, we show that the distribution should over-sample w.r.t. the standard least-squares confidence ellipsoid by a factor √ d to guarantee a constant probability of being optimistic.
2) We provide an alternative proof of TS achieving the same result as Agrawal and Goyal [2012b].One of our major finding is that, leveraging on the properties of support functions from convex geometry, we are able to prove that the regret is related to the gradient of the objective function, that is ultimately controlled by the norm of the optimal arms associated to any optimistic parameter θ.This provides a novel insight on the fact that whenever an optimistic parameter θ t is chosen, not only its instantaneous regret is small but the corresponding optimal arm x t = arg max x x T θ t represents a useful exploration step that improves the accuracy of the estimation of θ over dimensions which are relevant to reduce regret in any subsequent non-optimistic step.This approach allows us to avoid the introduction of saturated/unsaturated arms and it illustrates why any TS-like algorithm (not necessarily Bayesian) with a constant probability of being optimistic has a bounded regret.3) Finally, we show how our proof can be easily adapted to regularized linear optimization (with arbitrary penalty) and to the generalized linear model (GLM), for which we derive the first frequentist regret bound for TS, which was first suggested by Agrawal and Goyal [2012b] as a venue to explore.

Preliminaries
The setting.We consider the stochastic linear bandit model.Let X ⊂ R d be an arbitrary (finite or infinite) set of arms.When an arm x ∈ X is pulled, a reward is generated as r(x) = x T θ + ξ, where θ ∈ R d is a fixed but unknown parameter and ξ is a zero-mean noise.An arm x ∈ X is evaluated according to its expected reward x T θ and for any θ ∈ R d we denote the optimal arm and its value by Then x = x (θ ) is the optimal arm for θ and J(θ ) is its optimal value.At each step t, the learner selects an arm x t ∈ X based on the past observations (and possibly additional randomization), it observes the reward r t+1 = x T t θ + ξ t+1 , and it suffers a regret equal to the difference in expected reward between the optimal arm x and the arm x t .All the information observed up to time t is encoded in the filtration , where F 1 contains any prior knowledge (e.g., the bound S).The objective of the learner is to minimize the cumulative regret up to step T , i.e., R(T ) = T t=1 x ,T θ − x T t θ .We introduce general assumptions on the structure of the problem and on the noise ξ t+1 .
Assumption 1 (Arm set).The arm set X is a bounded closed (and hence compact) subset of R d such that x ≤ X for all x ∈ X .We also assume X = 1.
Assumption 2 (Bandit parameter).There exists S ∈ R + such that θ ≤ S and S is known.Assumption 3 (Noise).The noise process {ξ t } t is a martingale difference sequence given F x t and it is conditionally R-subgaussian for some constant R ≥ 0, (2) Technical tools.Let (x 1 , . . ., x t ) ∈ X t be a sequence of arms and (r 2 , . . ., r t+1 ) be the corresponding rewards, then θ can be estimated by regularized least-squares (RLS).For any regularization parameter λ ∈ R + , the design matrix and the RLS estimate are defined as We recall an important concentration inequality for RLS estimates.
Proposition 1 (Thm. 2 in Abbasi-Yadkori et al. [2011a]).For any δ ∈ (0, 1), under Asm.1,2, and 3, for any F x t -adapted sequence (x 1 , . . ., x t ), the RLS estimator θ t is such that for any fixed t ≥ 1, w.p. 1 − δ (w.r.t. the noise {ξ t } t and any source of randomization in the choice of the arms), where At step t, we define the ellipsoid by V t and radius β t (δ ), where δ = δ/4T .From Eq. 4 we have that θ ∈ E RLS t with high probability.Finally, we report a standard result of RLS that, together with Prop. 1, shows that the prediction error on the x t s used to construct the estimator θ t is cumulatively small.Proposition 2. Let λ ≥ 1, for any arbitrary sequence (x 1 , x 2 , . . ., x t ) ∈ X t let V t+1 be the corresponding design matrix (Eq.3), then This result plays a central role in most of the proofs for linear bandit, since the regret is usually related to ||x s || V −1 s and Prop. 2 is used to bound its cumulative sum.While Agrawal and Goyal [2012b] achieve this by dividing arms in saturated and unsaturated, we follow a different path that leverages on the core features of the problem (structure of J(θ)) and of TS (probability of being optimistic).
3 Linear Thompson Sampling Agrawal and Goyal [2012b] define TS for linear bandit as a Bayesian algorithm where a Gaussian prior over θ is updated according to the observed rewards, a random sample is drawn from the posterior, and the corresponding optimal arm is selected at each step.
As hinted by Agrawal and Goyal [2012b], we show that TS can be defined as a generic randomized algorithm constructed on the RLS-estimate rather than an algorithm sampling from a Bayesian posterior (see Fig. 1).At any step t, given RLS-estimate θ t and the design matrix V t , TS samples a perturbed parameter θ t as where η t is a random sample drawn i.i.d.from a suitable multivariate distribution D TS , which does not need to be associated with an actual posterior over θ .Then the optimal arm x t = x ( θ t ) is chosen, a reward r t+1 is observed and V t and θ t are updated according to Eq. 3. Notice that the resulting distribution on θ t is obtained rotating η t by the design matrix V t and scaling it by β t (δ).The computational complexity of TS is determined by the linear optimization problem solved when computing x ( θ t ) and by the sampling process from D TS .This is in contrast with OFUL [Abbasi-Yadkori et al., 2011a], which requires solving a bilinear optimization problem (i.e., arg max θ max x x T θ).
The key aspect to ensure small regret is that the perturbation η t is distributed so that TS explores enough but not too much.This translates into the following conditions on D TS .
Definition 1. D TS is a multivariate distribution on R d absolutely continuous with respect to the Lebesgue measure which satisfies the following properties: 1. (anti-concentration) there exists a strictly positive probability p such that for any u ∈ R d with u = 1, 2. (concentration) there exists c, c positive constants such that ∀δ ∈ (0, 1) Once interpreted in the construction of θ t , the definition of D TS basically requires TS to explore far enough from θ t (anti-concentration) but not too much (concentration).This implies that TS performs "useful" exploration with enough frequency (notably it performs optimistic steps), but without selecting arms with too large regret.Let γ t (δ) = β t (δ ) cd log(c d/δ), then we introduce the high-probability ellipsoid The difference between E RLS t and E TS t lies in the additional factor √ d in the definition of γ t (δ) and it is crucial for both concentration and anti-concentration to hold at the same time.In Sect. 5 we prove that any distribution satisfying the conditions in Def. 1 introduces the right amount of randomness to achieve the desired regret without actually satisfying any Bayesian assumption.Def. 1 includes the Gaussian prior used by Agrawal and Goyal [2012b], but also other types of distributions such as the uniform on the unit ball B d (0, √ d) or distributions concentrated on the boundary of E TS t (refer to App.A for exact values of c, c , and p for uniform and Gaussian distributions).

Sketch of the proof
In this section we report a sketch of the proof providing a geometric intuition on the behavior of TS and how its actions (i.e., the sampled θ t and the corresponding x t ) influence the regret.For the sake of illustration, we consider the unit ball X = { x ≤ 1}, such that the optimal arm is just the projection of θ on the ball (x (θ) = θ/ θ ), and the optimal value is J(θ) = θ T θ/ θ = θ .We start by decomposing the regret using the definition of J(θ) as shows that both RLS estimate θ t and TS parameter θ t should concentrate appropriately.Since at each step t, θ t is sampled from D TS , the second term is kept under control by construction, while the first sum deals with the prediction error of RLS.As opposed to R TS , this error is not related to the exploration scheme and it is small for any sequence of arms.Intuitively, this is due to the fact that the RLS estimate is the minimizer of the regularized cumulative squared error θ T +1 = arg min θ T t=1 |r t+1 − x T t θ| 2 + λ θ 2 , so that x T t θ T +1 is an accurate prediction on the arms observed so far.The RLS minimizes the error in "hindsight" (i.e., after all rewards up to T ) and therefore it also controls the online error |r t+1 − x T t θt+1 | 2 , since by induction Having a small online error also implies a small prediction error |r t+1 − x T t θt | 2 .In fact, using a recursive version of Eq. 3, we have θt+1 Since the cumulative prediction error is small, then the associated regret | is also small.This result can be seen as an intrinsic on-policy error guarantee of RLS.Nonetheless, notice that while RLS minimizes the prediction error for any sequence of arms, this does not imply the consistency of the estimator.For instance, when the same arm x is repeatedly played, the unknown parameter θ is wellestimated in the direction of x (thus making R RLS (T ) small) but it is poorly estimated in any other directions.This shows the need for a careful exploration strategy to recover consistency and hence a sub-linear regret.

Bounding R TS (T ).
We denote by For optimistic algorithms this term is bounded by 0 at any step since w.h.p.J( θ t ) ≥ J(θ * ) by construction.In the Bayesian regret analysis of TS, this term is equal to 0 by assumption that θ * is drawn from the same prior as θ t .On the other hand, in the frequentist analysis, we have to control the deviations caused by the random sampling of θ t .This is achieved by showing that the arms selected by TS provide "useful" information about θ and contribute to keep the regret small.We follow three steps: 1) we show that the regret is related to the sensitivity of J w.r.t. the errors in estimating θ and we bound the regret with the gradient of J(θ) at any optimistic θ; 2) we show how the gradient in a point θ is intrinsically related to its corresponding optimal arm x (θ); 3) since we prove that TS is frequently optimistic, then we can finally link x (θ) to x t = x ( θ t ) and Prop. 2 allows us to finally bound the overall regret.
per-step regret Step 1 (regret and sensitivity of J).We first show why the exploration of TS should be well adapted to J(θ).
Using the definition of J(θ) = θ we have where λ min,t is the smallest eigenvalue of V t .This bound shows that it is sufficient to estimate θ accurately over all its components (i.e., λ min,t tends to zero) to obtain a no-regret algorithm.Nonetheless, the desired regret bound of O( √ T ) is obtained only if λ min,t decreases as O(1/t).While this could be achieved by a fully explorative algorithm (e.g., a round robin over the canonic vectors e i reduces the ellipsoid E TS t to a ball of radius λ min,t ), it would severely increase the second term of R RLS (T ) and cause an overall linear regret 2 .Fortunately, inspecting the definition of R TS t reveals that not all components of θ must be equally well estimated.In fact, we have w.h.p. that This shows that R TS t is determined by the diameter of ellipsoid E TS t w.r.t.J, which suggests that the estimation of θ should be more accurate on the dimensions on which J is more sensitive.In the case of X unit ball, the most sensitive direction of J is θ / θ itself and Fig. 3 illustrates two opposite cases where the accuracy in the estimation of θ is the same (i.e., V t has the same eigenvalues) but the regret may be very different.Let Θ opt = {θ : J(θ) ≥ J(θ )} be the set of optimistic parameters.In our example J(θ) = θ is convex thus we can make explicit the dependency of the regret on the sensitivity of J through its gradient evaluated at any θ ∈ Θ opt as (see Prop. 3 for the general case) 2 This happens because xt would be optimal w.r.t. a θt, which is not in the ellipsoid which shows that the regret of non-optimistic θ t is bounded by the gradient of J(θ) at any optimistic θ and its distance to any other point in the TS ellipsoid.
Step 2 (sensitivity of J and optimal arm).According to Prop. 1, the second factor in the previous expression is small whenever θ belongs to the ellipsoid, while the first term cannot be immediately controlled by the algorithm.Nonetheless, we notice that since J(θ) = θ , then ∇J(θ) = θ/ θ = x (θ) (see Lem. 2 for the general case).This shows how selecting the optimal arm associated to an optimistic θ is equivalent to controlling the gradient of J, which results in From Prop. 2, we could conclude that the regret would be cumulatively small if x (θ) corresponded to the arms chosen by the TS (x t = x ( θ t )).As a result, we need a θ 1) that is optimistic (i.e., θ ∈ Θ opt ), 2) it belongs or is close to the ellipsoid E TS t and 3) it is used to select an arm x t .The first two requirements are at the core of the choice of the TS distribution in Def. 1 where the anticoncentration property guarantees enough probability to be optimistic, while the concentration property implies that θs are within a small ellipsoid.Let τ < t be any step when TS selects θ τ ∈ Θ opt with corresponding arm x τ = x ( θ τ ), then we have (see an illustration of this bound in Fig. 2 in the 1-d case) Since by Prop. 1 θ is contained in all confidence ellipsoids with high probability, then Let K be the number of times θ t ∈ Θ opt , t k the corresponding steps, and ν k = t k − t k−1 , then the final regret can be written as .
Step 3 (optimism).This bound shows the importance that TS is optimistic with high frequency.In fact, whenever θ t is in Θ opt , not only the corresponding instantaneous regret R TS t is upper-bounded by 0, but the exploration performed by playing arm x ( θ t ) has also a positive impact in controlling the regret for any subsequent non-optimistic step.Consider the extreme case when TS is never optimistic, then K = 1, ν 1 = T and R TS (T ) = O(T ).On the other hand, if TS is optimistic with a constant frequency, then we can easily show that R TS (T ) is bounded by O( √ T ).Consider the case where an optimistic θ is chosen with probability p.Since E[ν k ] = 1/p, we can prove that w.h.p.R TS (T ) ≤ O(1/p √ T ) by Cauchy-Schwarz and Prop. 2 applied to , where K ≈ T .Unfortunately, sampling θ t from the RLS ellipsoid E RLS t may have a very small probability of being optimistic (see e.g., Fig. 2, where sampling uniformly in E RLS t has zero probability to return a θ t ∈ Θ opt ).For this reason, TS is required to draw θ t from a distribution over-sampling by a factor as in the definition of D TS .This guarantees a fixed probability p of being optimistic (see Lem. 3) and the final desired regret.

Formal Proof
In this section we report the main steps of the regret analysis, while we postpone technical lemmas to the supplementary material.We prove the following result.
As anticipated in introduction, this bound is of order O(d 3/2 √ T ) and it entirely matches the result of Agrawal and Goyal [2012b].The analysis of the regret requires extra care in the definition of the filtrations.While in analyzing R RLS we consider all the knowledge up to step t (i.e., including the sampled parameter θ t ), in R TS we need to study the randomness of θ t conditional on all the information before sampling η t .We introduce an additional filtration besides F x t .Definition 2. We define the filtration F t as the accumulated information up to time t before the sampling procedure, i.e., F t = (F 1 , σ(x 1 , r 2 , x 2 , . . ., x t−1 , r t−1 )).
Notice that θ t and V −1 t are both F t and F x t adapted, while θ t is a random variable w.r.t.F t and it is fixed when considering F x t .Hence we have 3 , . . . .We are now ready to introduce the high-probability events we use in the rest of the proof.Definition 3. Let δ ∈ (0, 1) and δ = δ/(4T ) and t ∈ [1, T ].We define E t as the event where the RLS estimate concentrates around θ for all steps s ≤ t, i.e., E t = ∀s ≤ t, θ s − θ Vs ≤ β s (δ ) .We also define E t as the event where the sampled parameter θ s concentrates around θ s for all steps s ≤ t, i.e., E t = ∀s ≤ t, θ s − θ s Vs ≤ γ s (δ ) .

Then we have that
Conditioned on F t and event E t , we have θ ∈ E RLS t , while on event E t we have θ t ∈ E TS t , then we directly bound the regret as In the interest of space we only report the formal proof to bound R TS t , while the bound on R RLS (T ) and the overall regret is postponed to App.D.
Similar to the sketch in Sect.4, the proof follows three steps: 1) we use the convexity of J to upper-bound the regret by its expectation conditioned on being optimistic and to relate it to the gradient of J, 2) we relate the gradient of J to the arms chosen by TS over time, 3) we show that despite the randomization, TS has a constant probability of being optimistic.
Step 1 (Regret and gradient of J(θ)).On event E t , θ t belongs to E TS t and thus Recalling that Θ opt is the set of all optimistic θs, we can bound the previous expression by the expectation over any random choice of θ in Θ opt , that is η with η ∼ D TS is the TS sampling distribution.We now rely on the following characterization of J(θ) (see App. C).
Proposition 3.For any set of arm X satisfying Asm. 1, J(θ) = sup x x T θ has the following properties: 1) J is real-valued as the supremum is attained in X , 2) J is convex on R d , 3) J is continuous with continuous first derivative except for a zero-measure set w.r.t. the Lebesgue's measure.
These properties follow from the fact that J is the support function of X and it shows that J is convex for any arm set X .As a result, we can directly relate R TS t to the gradient of J as where we use Cauchy-Schwarz and we "push" the event E t into the conditioning.
Step 2 (From gradient of J(θ) to optimal arm x (θ)).In the sketch of the proof there was a direct relationship between ∇J(θ) and the optimal arm corresponding to θ by direct construction.In the next lemma, we show that this connection is true for any arm set X (proof in App.C).Lemma 2. Under Asm. 1, for any θ ∈ R d , we have ∇J(θ) = x (θ) except for a zero-measure set w.r.t. the Lebesgue's measure.
This property strongly connects the exploration of TS to the actual regret.In fact, together with Prop.2, it implies that selecting the optimal arm associated with any optimistic θ is equivalent to reducing the gradient of J and ultimately the regret R TS t .This motivates the next step where we show that since TS is often optimistic, then the arm x t = x ( θ t ) contributes to the reduction of the regret.
Step 3 (Optimism).The optimism of TS is a direct consequence of the convexity of J and the fact that the distribution of η is oversampling by a factor √ d w.r.t. the ellipsoid E RLS t (proof in App.D).Lemma 3. Let Θ opt t := {θ ∈ R d | J(θ) ≥ J(θ )} be the set of optimistic parameters and Let f ( θ t ) be an arbitrary non-negative function of θ t , then we can write the full expectation as θ Vt 1{ E t } and reintegrating the event E t , we have where 1/p can be interpreted as the expected time between any two optimistic samples.Since θ is sampled according to the standard TS sampling distribution, then it belongs to E TS t and x Finally, we can use Azuma's inequality to obtain the final bound where x t is the actual optimal arm x ( θ t ) selected at time t by TS.The proof is concluded using Cauchy-Schwarz and Prop. 2 to bound R TS (T ) and Prop. 1 to bound R RLS (T ).

Discussion
In this paper we developed an alternative proof for TS in LB with novel insights on the core elements of the algorithm (optimism) and of the structure of the problem (support function J(θ)).In the following, we discuss possible applications of our results and future directions of investigation.
Regularized linear optimization.Our proof holds for any arm set X and the corresponding constrained optimization problem max x∈X x T θ .Similarly, we can apply it to any regularized linear optimization problem max where µ is a constant and c(x) is an arbitrary penalty function of x (e.g., norm-regularization).While there always exists a set of of constraints (corresponding to a set of arms X c,µ,θ ) such that the solution to the constrained and regularized problems coincide, such mapping is often unknown (e.g., c(x) = x 1 ) and thus TS cannot be run on X c,µ,θ but we need to directly deal with the regularized problem (i.e., sampling θ t and pulling arm x t = arg max x f µ,c (x; θ t )).In this case, it can be seen that the three main steps of our proof still hold.In fact (see App. G), 1) J(θ) is convex, 2) the gradient of J(θ) corresponds to the optimal arm x * (θ), 3) Lemma 3 holds unchanged since it relies on the convexity of J(θ) and the TS distribution D TS is the same.As a result, the regret bound follows.On the other hand, the original proof by Agrawal and Goyal [2012b] could be less readily applied to this case.First notice that the mapping from µ and c(x) to the constrained set X c,µ,θ requires the unknown parameter θ .This means that if we pass from the regularized problem to the constrained problem at each time step t, we would be working on a set X c,µ, θt which keeps changing over time.While Agrawal and Goyal [2012b] study the contextual bandit problem where X t changes arbitrarily over time, in this case X t would change in response to θ t itself (i.e., it would not available in advance) and the analysis would bound the per-step regret r t = max x∈X c,µ, θ t x T θ − x T t θ, which does not correspond to the desired regret on f µ,c (the true optimal arm x (θ ) may not even be in X c,µ, θt ).Alternatively, we need to formulate a suitable definition of saturated and unsaturated arms for f µ,c (x; θ), which does not seem trivial and it may require developing a more ad-hoc analysis.
Other extensions.Another interesting setting to study is stochastic combinatorial optimization with semi-bandit feedback, where the arm set is the hypercube and each component of the linear combination x T θ is observed.While Wen et al. [2015] derived a frequentist regret bound for a UCB-like strategy, only a Bayesian regret analysis for TS is available.Exploiting the fact that combinatorial optimization is a special case of linear optimization, our analysis could be adapted to derive frequentist regret bounds.In Sect.F we show that we can deal with more complex scenarios and we derive the first frequentist regret bound for TS in generalized linear models (GLM).Moreover, we can generalize our proof to the other convex optimization problems max x∈X f (x, θ), with linear observations (i.e., y = x T θ + ξ).If f (x, θ) is convex in θ, then J(θ) is convex as well, thus enabling the possibility to apply our line of proof.More precisely, the gradient of J to the arms played by TS should be related (step 2, Lem. 2) and the on-policy prediction error R RLS measured w.r.t.f should be bounded (Prop.1).Whenever these properties are satisfied, the regret result follows.Notice that while the original proof by Agrawal and Goyal [2012b] may be extended to cover some of these problems, its requirements are slightly stronger.In fact, the definition of saturated and unsaturated arms relies on the fact that f (x, θ n ) concentrates to f (x, θ) for any x, while in our case, we only need to bound R RLS , which corresponds to an on-policy error, where prediction errors are measured on the specific arms selected by the algorithm.While this advantage may appear abstract, let consider the reinforcement learning case, where f (x, θ) is the value function of a policy x in an environment θ.In this case, f (x, θ ) may actually be unbounded for some x (i.e., the policy x does not control the system) and the definition of saturated/unsaturated arms could not be easily adjusted.This suggests that our proof could enable covering special RL cases as well.Finally, we remark that defining TS as a randomized algorithm and using convex geometry arguments in its analysis bears a strong resemblance with follow-the-pertubedleader algorithm and its regret analysis in adversarial linear bandit [Abernethy et al., 2015], suggesting that the two approaches may be strongly related.
About optimism and oversampling.As illustrated in Sect.4, in the current proof optimistic steps allows to bound the regret of non-optimistic steps.Nonetheless, it can be shown that some non-optimistic steps (even very pessimistic!) may indeed be as "informative" as optimistic steps and allow reducing the regret as well.Let consider a minor change in the line of proof, anticipating the use of the convexity of J, i.e., If we sample a θ such that the gradient at it ∇J( θ) (i.e., which coincides with the corresponding optimal action x ( θ)) has the same V −1 t -norm as ∇J(θ ), then we could apply the same reasoning as in the original sketch of the proof and bound the regret of any subsequent step.More formally, we can define the set } of parameters that have larger gradient than θ 's.Similar to Θ opt , if the probability of sampling θ in Θ grad t is lower-bounded by a constant p , then the proof can be reproduced with exactly the same arguments and result.Even further, we could relax the requirement and define }, with α < 1, which would allow even a bigger probability at the cost of an extra constant factor α in the final regret.As illustrated in Fig. 4, in the case X = R d , Θ grad t (α) corresponds to a cone whose overlap with E TS may actually be even larger than for Θ opt .This illustration shows that the set of useful explorative actions does not necessarily coincide with the set of optimistic parameters and that many more parameters in E TS may contribute to reduce the regret.This may explain the empirical success of TS and it may suggest that the oversampling by a factor √ d to ensure optimism may be a too strong requirement.Finally, we remark that a similar optimistic argument is employed by Agrawal and Goyal [2013] in MAB.Nonetheless, in Lemma 2 they prove that the probability of being optimistic increases over time.This may suggest that E TS needs to be only a constant fraction bigger than E RLS , since the initial small probability of being optimistic would tend to a constant (or even to 1) later on during the learning process.Whether this argument holds and how to prove it remains an open question.

A Examples of TS distributions
) is an hyper-spherical cap for any direction u of R d , the the anti-concentration property is satisfied provided that the ratio between the volume of an hyper-spherical cap of height √ d − 1 and the volume of the ball of radius √ d is constant (i.e., independent from d).Using standard geometric results (see Prop. 9), one has that for any vector u = 1 where I x (a, b) is the incomplete regularized beta function.In Prop. 10 we prove that and hence we obtain p = 1 16 √ 6π . .
Example 2: Gaussian case η ∼ N (0, I d ).The concentration property comes directly from the Chernoff bound for standard Gaussian random variable together with union bound argument.For any α > 0, we have Standard concentration inequality for Gaussian random variable gives, ∀α > 0, Plugging everything together with α = 2 log 2d δ gives the desired result with c = c = 2. Let η i be the i-th component of η for any 1 ≤ i ≤ d.Then η i ∼ N (0, 1).Since η is rotationally invariant, for any direction u of R d and an appropriate choice of basis, we have P(u T η ≥ 1) ≥ P(η 1 ≥ 1).From standard Gaussian properties (see Thm 2 of Chang et al. [2011]) we have which ensures the anti-concentration property with p = 1 4 √ eπ .Then define y = x + (x − x) for some x ∈ bound(C).By definition of the open set int(C), ∃ > 0 such that y ∈ int(C).Moreover, x ∈ [y, x] e.g.

B Properties of convex function
Using the convexity of f on has which is impossible by assumption.
Proposition 5. Let f : R d → R be a convex function.Let B d (0, 1) be the unit d−dimensional ball and S d (0, 1) the associated unit sphere.Given a point x ∈ S d (0, 1), define as H(x) the hyperplan tangent to B d (0, 1) at the point x.H(x) split R d into two complementary subspace G(x) and G ⊥ (x) where G(x) does not contain the unit ball by convention.Then for any x ∈ S d (0, 1) such that f (x ) ≥ f (x) for all x ∈ B d (0, 1), one has Proof.We first notice that from Proposition 4 x is well defined since the maximum is reached on the boundary.The associated subspace G(x ) is then We want to show that f (y) ≥ f (x ) for any y ∈ G(x ).We introduce the increasing sequence of subspace , n ≥ 2.
For any y = x + u in G n , we associate By definition of y (and hence u), we have Since the statement of the proposition holds for any G n , then we obtain the desired result for G by continuity of f .Let y ∈ G(x ), y = x + u.If u T x > 0, then ∃n ≥ 2 such that y ∈ G n and the proposition is satisfied.Otherwise, if u T x = 0, we introduce the sequences {u n } and {y n } defined as: x , By construction, y n ∈ G n and y n → y as n → ∞.Since the f (y n ) ≥ f (x ) for any n ≥ 2 we obtain the desired result taking the limit since f is continuous as a convex function on R d .
Theorem 2 (A.D. Alexandrov).Let f : R d → R be a convex function, then it is twice differentiable almost everywhere with respect to the Lebesgue's measure.
Proof.This result is an extension of the Rademacher's theorem for convex functions.A proof can be found in Niculescu and Persson [2006], theorem 3.11.2.

C Properties of support function (proof of Proposition 3 and Lemma 2)
We study the support function of a set C, which is a function f Those functions are at the core of convex geometry analysis.
Proposition 6.Let C ⊂ R d be a non-empty compact set and f C the associated support function.Then, 3. f C is continuous on R d and twice differentiable almost everywhere with respect to the Lebesgue's measure.
Proof. 1.This comes directly from the compactness of C: since C is bounded, the support function is real-valued and since C is closed, the supremum is attained in C, 2. Let θ 1 , θ 2 two vectors of R d , and t ∈ (0, 1).By definition of the supremum, since f C is real-valued: Proof.Thanks to proposition 6, we know that the supremum is attained in x(θ) ∈ C.Moreover, Alexandrov's theorem guarantee that N is a null-set.Since the sub-gradient is reduced to a singleton where the function is differentiable e.g.∂f C (θ) = {∇f C (θ)} for all θ ∈ R d \ N , one just need to show to x(θ) ∈ ∂f C (θ) for all θ ∈ R d .Since f C (θ) = max x∈C x T θ, their exist at least one x(θ) ∈ C for which the maximum is attained i.e. x(θ) which is the definition of the sub-gradient.

D Regret Proofs
We collect here the main tools that we need to derive the proof.We first recall the Azuma's concentration inequality for super-martingale.
Proposition 8.If a super-martingale (Y t ) t≥0 corresponding to a filtration F t satisfies |Y t − Y t−1 | < c t for some constant c t for all t = 1, . . ., T then for any α > 0, Proof of Lemma 1.We first bound the two events separately.
Bounding E. This bound is a straightforward application of Proposition 1 together with a union bound argument.Let δ = δ/(4T ), then Bounding E. This bound comes directly from the concentration property of the TS sampling distribution.
From the expression of θ t = θ t + β t (δ )V −1/2 t η t where η t is drawn i.i.d.from D TS , we have Then from Definition 1, we have As before, a union bound over the two bounds ensures that Finally, a union bound argument between the two terms leads to Proof of Lemma 3. We need to study the probability that a θ drawn at time t from the TS sampling distribution is optimistic, i.e., J( θ) ≥ J(θ ), under event E t .More formally let Using the definition of E t we have that θ ∈ E RLS t (i.e., the true parameter vector belongs to the RLS ellipsoid) and then we can replace J(θ ) by the supremum over the ellipsoid as By recalling the definition of the TS sampling process, we can write θ = θ t + β t (δ )V −1/2 t η, where η ∼ D TS and for notational convenience, we define the function Since the supremum is taken within E RLS t , η t belongs to the unit ball (i.e., η t ∈ B d (0, 1)).As a result, we can rewrite the previous expression as Since the function f t inherits all the properties of J, notably its convexity in η, we know that the supremum on a convex closed set is reached at least at one point ηt and that it belongs to the boundary (see Prop. 4), which in our case corresponds to η t = 1.Moreover, let H t (η t ) be the hyperplane tangent to η t .H t (η t ) splits R d in two complementary subspaces G t and G ⊥ t where G t does not contain the unit ball by convention.Again, the convexity of f t ensures that f t (η) ≥ f t (η t ) for all η ∈ G t as proved in Prop. 5.As illustrated in Fig. 5 the probability of being optimistic is now reduced to the probability that η drawn from D TS falls into G t , which corresponds to Let u t be the vector defining the hyperspace H t (η t ), notice that the subspace u t is entirely defined by the filtration F t and the event E t and it is thus independent from η t .As a result, we finally obtain where the last step immediately follows from property 1 of Def. 1 of the TS sampling distribution.
Proof of Theorem 1.We first bound the two regret terms R TS (T ) and R RLS (T ).
Bound on R TS (T ).We collect the bounds on each term R TS t and obtain Since this term contains an expectation, we cannot directly apply Proposition 2 and we first need to rewrite to the total regret R TS (T ) as From Prop. 2, the first term is bounded as, T t=1 We now proceed applying Azuma inequality 8 to the second term which is a martingale by construction.Under assumption 1, x t ≤ 1 for all t ≥ 1, so since This provides an upper-bound on each element of R TS 2 which holds with probability at least 1 − δ 2 as Bound on R RLS (T ).The bound on R RLS is derived as previous results in [ Abbasi-Yadkori et al., 2011b, Agrawal andGoyal, 2012b].We decompose the term in a sampling prediction error and a RLS prediction error as follow Final bound.We finally plug everything together since from lemma 1 the concentration event holds with probability at least 1 − δ 2 .Using the bound on R TS (T ) and a union bound argument one obtains the desired result which holds with probability at least 1 − δ.

E Hyperspherical cap and beta function
where I x (a, b) is the incomplete regularized beta function.
Proof.The proof can be found in Li [2011].
Proof.The incomplete regularized beta function can be expressed in terms of the beta function B(a, b) and the incomplete beta function B x (a, b) where Hence we seek for a lower bound on B 1− 1 d d+1 2 + 1 2 and an upper bound for B d+1 2 + 1 2 .
1. Let first find an lower bound for the incomplete beta function.Since t → t From the increasing property of x → (1 − α x ) x for any α < 1 the sequence (1 − 3 2d ) is increasing and 2. Now we seek for an upper bound for B d+1 2 + 1 2 .Since B(a, b) = Γ(a)Γ(b) Γ(a+b) one has: From Chen and Qi [2005] we have the following inequalities for the gamma function ∀n ≥ 1: Γ(n + 1/2) Γ(n + 1) ≤ (n + 1/4) −1/2 Γ(n + 1/2) Γ(n + 1) ≥ (n + 4/π − 1) −1/2 Together with Γ(x + 1) = xΓ(x) and treating separately cases where d is even or not, one gets ∀d ≥ 2 Using the obtained upper and lower bound we get: F Generalized Linear Bandit We present here how to apply our derivation to the generalized linear bandit (GLM) problem of Filippi et al. [2010].The regret bound is obtained by basically showing that the GLM problem can be reduced to studying the linear case.
The setting.Let X ⊂ R d be an arbitrary (finite or infinite) set of arms.Every time an arm x ∈ X is pulled, a reward is generated as r(x) = µ(x T θ ) + ξ, where µ is the so-called link function, θ ∈ R d is a fixed but unknown parameter vector and ξ is a random zero-mean noise.The value of an arm x ∈ X is evaluated according to its expected reward µ(x T θ ) and for any parameter θ ∈ R d we denote the optimal arm and its optimal value as Then x = x (θ ) is the optimal arm associated with the true parameter θ and J GLM (θ ) its optimal value.At each step t, a learner chooses an arm x t ∈ X using all the information observed so far (i.e., sequence of arms and rewards) but without knowing θ and x .At step t, the learner suffers an instantaneous regret corresponding to the difference between the expected rewards of the optimal arm x and the arm x t played at time t.The objective of the learner is to minimize the cumulative regret up to a finite step T , Assumptions.The assumptions associated with this more general problem are the same as in the linear bandit problem plus one regarding the link function.Formally, we require assumption 1, 2 and 3 and add: Assumption 4 (link function).The link function µ : R → R is continuously differentiable, Lipschitz with constant k µ and such that c µ = inf θ∈R d ,x∈X (x T θ) > 0.
Technical tools.Let (x 1 , . . ., x t ) ∈ X t be a sequence of arms and (r 2 , . . ., r t+1 ) be the corresponding observed (random) rewards, then the unknown parameter θ can be estimated by GLM estimator.Following Filippi et al.
[2010] one gets, for any regularization parameter λ ∈ R + , where V t is the same design matrix as in the linear case.Similar to Prop. 1, we have a concentration inequality for the GLM estimate.
The Asm. 4 on the link function together with the properties of the GLM estimator implies the following: 1. since the first derivative is strictly positive, µ is strictly increasing and x (θ) = arg max x∈X x T θ so we retrieve the optimal arm of the linear case (and the support function), 2. the concentration inequality of the GLM estimate involves the same ellipsoid as for the RLS (multiplied by a factor 1 cµ ).
These two facts suggest to use then exactly the same TS algorithm as for the linear case (with a β multiplied by a factor 1 cµ ).Sketch of the proof.From the previous comments, making use of the property of µ, one just need to reduce Preliminary work.Under review by AISTATS 2017.Do not distribute.

Figure 4 :
Figure 4: Illustration of the non-optimistic region that could contribute to reduce the regret.

Example 1 :
Uniform distribution η ∼ U B d (0, √ d) .The uniform distribution satisfies the concentration property with constants c = 1 and c = e d by definition.Since the set {η|u

Proposition 4 .
Let f : R d → R be a convex function and C be a closed convex set of R d .Then, on C, f reaches its maximum on the boundary of C. Proof.Let's denote as int(C) and bound(C) the interior and the boundary of the closed convex set C respectively.Assume that ∃x ∈ int(C) such that f (x ) > f (x) for any x ∈ bound(C) and f (x ) ≥ f (y) for any y ∈ int(C).

x∈C x T θ 2 3.
The continuity is consequence of the convexity of f C on the open convex set R d and the second order differentiability comes from Alexandrov's theorem 2. Proposition 7. Let x(θ) ∈ arg sup x∈C x T θ, denote as ∇f C (θ) and ∂f C (θ) the gradient (when it is uniquely defined) and the sub-gradient of f C in θ ∈ R d .Then, 1. for all θ ∈ R d , x(θ) ∈ ∂f C (θ), 2. their exists a null set N with respect to the Lebesgue's measure such that x(θ) = ∇f C (θ) for all θ ∈ R d \ N , 3. equivalentely, x(θ) = ∇f C (θ) where the equality holds in the sense of the distribution.

Figure 5 :
Figure 5: Illustration of the probability of selecting an optimistic θ t .

Proposition 9 .
Let V d (R) be the volume of the d−dimensional ball of radius R and let V cap d (h) the volume of the hyperspherical cap of heigh h = R − r > 0.Then, Illustration of the steps 2) and 3) of the proof in R 1 and R 2 .Left: The regret at step t could be bounded by the gradient of the function J at a previous optimistic θτ times the distance between θτ and the current θt.
1 and E t 2,t have an equivalent accurate estimation of θ , E t 1,t has smaller regret than E t 2,t .