On Gap-Based Lower Bounding Techniques for Best-Arm Identification

In this paper, we consider techniques for establishing lower bounds on the number of arm pulls for best-arm identification in the multi-armed bandit problem. While a recent divergence-based approach was shown to provide improvements over an older gap-based approach, we show that the latter can be refined to match the former (up to constant factors) in many cases of interest under Bernoulli rewards, including the case that the rewards are bounded away from zero and one. Together with existing upper bounds, this indicates that the divergence-based and gap-based approaches are both effective for establishing sample complexity lower bounds for best-arm identification.


Introduction
The multi-armed bandit (MAB) problem [1] provides a versatile framework for sequentially searching for high-reward actions, with applications including clinical trials [2], online advertising [3], adaptive routing [4], and portfolio design [5]. The best-arm identification problem seeks to find the arm with the highest mean using as few arm pulls as possible, and dates back to the works of Bechhofer [6] and Paulson [7]. More recently, several algorithms have been proposed for best-arm identification, including successive elimination [8], lower-upper confidence bound algorithms [9,10], PRISM [11], and gap-based elimination [12]. The latter establishes a sample complexity that is known to be optimal in the two-arm case [13], and more generally near-optimal.
Complementary to these upper bounds is information-theoretic lower bounds on the performance of any algorithm. Such bounds serve as a means to assess the degree of optimality of practical algorithms, and identify where further improvements are possible, thus focusing research towards directions that can have the greatest practical impact. Lower bounds were given by Mannor and Tsitsiklis [14] for Bernoulli bandits, and by Kaufmann et al. [15] for more general reward distributions. Both of these works were based on the difficulty of distinguishing bandit instances that differ in only a single arm distribution, but the subsequent analysis techniques differed significantly, with [14] using a direct change-of-measure analysis and introducing gap-based quantities equaling the difference between two arm means, and [15] using a form of the data processing inequality for KL divergence. We refer to these as the gap-based and divergence-based approaches, respectively. Further works on best-arm identification lower bounds include [16][17][18].
The divergence-based approach was shown in [15] to attain a stronger result than that of [14] with a simpler proof, as we outline in Section 2.2. In this paper, we address the question of whether the gap-based approach is fundamentally limited, or can be refined to attain a similar results to [15]. We show that the correct answer is the latter in many cases of interest, by suitable refinements of the analysis of [14]. The existing results and our results are presented in Section 2, and our analysis is presented in Section 3.

Problem Setup
We consider the following setup: • There are M arms with Bernoulli rewards; the means are p = (p 1 , p 2 , · · · , p M ), and this set of means is said to define the bandit instance. Our analysis will consider instances with arms sorted such that p 1 ≥ p 2 · · · ≥ p M , without loss of generality. • The agent would like to find an arm whose arm mean is within of the highest arm mean for some 0 < ε < 1, i.e., p l > p 1 − ε. Even if there are multiple such arms, just identifying one of them is good enough.
• In each round, the agent can pull any arm l ∈ [M] and observe an reward X (s) l ∼ Bernoulli(p l ), where s is the number of times the l-th arm has been pulled so far. We assume that the rewards are independent, both across arms and across times.
• In each round, the agent can alternatively choose to terminate and output an arm indexl believed to be -optimal. The index at which this occurs is denoted by T, and is a random variable because it is allowed to depend on the rewards observed. We are interested in the expected number of arm pulls (also called the sample complexity) E p [T] for a given instance p, which should ideally be as low as possible. • An algorithm is said to be (ε, δ)-PAC (Probably Approximately Correct) if, for all bandit instances, it outputs an ε-optimal arm with probability at least 1 − δ when it terminates at the stopping time T.
We will frequently make use of some fundamental quantities. First, the best arm mean and the gap to the best arm are denoted by The set of -optimal arms and the set of -suboptimal arms are respectively given by and we make use of the binary KL divergence function where here and subsequently, log(·) denotes the natural logarithm.

Existing Lower Bounds
For any fixed p ∈ (0, 1/2), Mannor and Tsitsiklis [14] showed that if an algorithm is ( , δ)-PAC with respect to all instances with min l p l ≥ p > 0, and if ≤ 1−p * 4 and δ ≤ e −8 /8, then for any constant α ∈ (0, 2), there exists c 1 = O(p 2 ) (depending on α) such that Note that the subsetsM(p, ε) andÑ (p, ε) do not always form a partition of the arms, i.e., it may hold thatM(p, ε) ∪Ñ (p, ε) [M]. The sets increase in size as α decreases, but implicitly this leads to a lower value of c 1 . In addition, as we will see below, the p 2 dependence entering via c 1 is not necessary. We also note that the lower bound in (6) depends on the instance-specific quantitiesM(p, ε), N (p, ε), and ∆ l , and is thus an instance-dependent bound. On the other hand, the lower bound is only stated for (ε, δ)-PAC algorithms, and the PAC guarantee requires the algorithm to eventually succeed on any instance (subject to the assumptions given on p l , , and δ).

Our Result and Discussion
Our lower bound, stated in the following theorem, is developed based on Mannor and Tsitsiklis's analysis for best-arm identification [14] (Theorem 1), but uses novel refinements of the techniques therein to further optimize the bound (see Appendix C for an overview of these refinements).

Theorem 1.
For any bandit instance p ∈ (0, p * ] M with p * ∈ (0, 1), and any (ε, δ)-PAC algorithm with where and ξ > 0 is the unique positive solution of the following quadratic equation: Observe that this result matches (11) (with modified constants), and therefore exhibits the above benefit of depending on the full sets M and N without the condition p l ≥ ε+p * 2−α (see (7)- (8)), as well as avoiding the dependence on p, and permitting the broadest range of and δ among the above results.
The result (11) in turn matches (9) whenever the right-hand inequality in (10) is tight (i.e., ). This is clearly true when p and q (representing the arm means) are bounded away from zero and one, and also in certain limiting cases approaching these endpoints (e.g., when p and q both tend to one, but (1)). However, there are also limiting cases where the upper bound in (10) is not tight (e.g., p = 1 − √ η and q = 1 − η as η → 0), and in such cases, the bound (9) remains tighter than that of Theorem 1.

Proof of Theorem 1
We follow the general steps of (Theorem 5 [14]), but with several refinements to improve the final bound. The main differences are outlined in Appendix C.
Step 1: Defining a Hypothesis Test Let us denote the true (unknown) expected reward of each arm by Q l for all l ∈ [M]. Similarly to [14,15], we consider M hypotheses as follows: and for each l = 1, If hypothesis H l is true, the ( , δ)-PAC algorithm must return arm l with probability at least 1 − δ. We will bound sample complexity when the hypothesis H l is true. We denote by E l and P l the expectation and probability, respectively, under hypothesis H l . Let B l be the event that the algorithm returns arm l.
it follows that Define as well as which is the event that the policy eventually select an arm in the ε-neighborhood of the best arm in [M]. Since the policy is (ε, δ)-correct with δ < δ 0 , we must have and it follows from (18) and (22) that for all l ∈ T (p, ε).
Step 2: Bounding the Number of Pulls of Each Arm Before proceeding, we make some additional definitions: The definitions (27) and (28) will only be used for arms with ε+∆ l p l ≤ 1 2 , and for such arms, we will establish in the analysis thatα l ≥ 0 andβ l ≥ 0.
We prove the following lemma, characterizing the probability of a certain event in which (i) the number of pulls of some arm l ∈ T (p, ε) falls below a suitable threshold (event A l below), (ii) a deviation bound holds regarding the number of observed 1's from pulling arm l (event C l below), and (iii) arm l is not returned (event B c l ).

Lemma 1.
For each l ∈ [M], let T l be the total number of times that arm l is pulled under the (ε, δ)-correct policy. Let K l = X l + X l + · · · + X (T l ) l be the total number of unit rewards obtained from pulling the arm l up to the T l -th time. Let where α l ,α l , andβ l are defined in (25), (27), and (28), respectively. Let where ξ is defined in (15). Define the following events: If l ∈ T (p, ε) (see (20)), then under the condition that we have Proof. See Appendix A.
Intuitively, A l is the event that the total number of times that arm l is pulled is small, and C l is the event that |p l T l − K l | is not too large (since pulling an arm T l times should produce roughly p l T l ones). The lemma indicates that if E[T l ] is not too large, then P[A l ∩ B c l ∩ C l ] is lower bounded, and this will ultimately lead to a lower bound on P[B c l ], the event of primary interest. In Lemma 2 below, we will use Lemma 1 to deduce a lower bound on E 1 [G 1,l ], which amounts to a lower bound on the average number of arm pulls by the definition of G 1,l . Before doing so, we introduce a likelihood ratio that will be used in a change-of-measure argument [14].
For any given time t ≥ 1 and l ∈ [M], let T l (t) be the total number of times that arm l is pulled by time t. Define X T l (t) l and let be the σ-algebra generated by X for all t = 1, 2, . . .. Recall that T is the stopping time of the algorithm, and that T l := T l (T) for all l ∈ [M]. Moreover, let W = F T be the entire history up to the stopping time T. We define the following likelihood ratio: for every possible history w. Moreover, we let L l (W) denote the corresponding random variable. Given the history up to time T − 1 (i.e., F T−1 ), the arm reward at time T has the same probability distribution under H 1 and H l unless the chosen arm is arm l. Therefore, we have where K l := X l + X l + · · · + X (T l ) l (or the total number of 1's in the T l pulls of the arm l). The following proposition presents one of our key technical results towards establishing the lower bound. We use the definitions in (1)-(5), along with (25)-(28).

Proof. See Appendix B.
Based on Lemma 1 and Proposition 1, we obtain the following extension of [14] (Lemma 6) lower bounding the average of each G 1,l ; this lower bound will later be translated to a lower bound on the number of arm pulls T l .
Proof. We use a proof by contradiction. Assume that then by Lemma 1, Equation (36) holds. Moreover, by Proposition 1, we have and recalling the definition of S l in (34), it follows from (44) that where (46) follows from the definitions in (32)-(33), and (47) follows from the fact that By the choice of ν l > 0 given in (31), it holds that Hence, from (47) and (48), we have for all l ∈ T (p, ε), by the definition of θ in (14).
We are now ready to complete the proof: where (50) (43)).
The inequality (56) shows a contradiction to the fact that under H l , the (ε, δ)-correct bandit policy must return the arm l with probability at least 1 − δ, i.e., P l (B c l ) ≤ δ. This concludes the proof.
From Lemma 2 and the definition of G 1,l in (29), it holds that for all l ∈ T (p, ε). Hence, and using the definition of ν l in (31), we have Step 3: Deducing a Lower Bound on the Sample Complexity For any arm l ∈ T (p, ε), by the definition of α l in (25), we have Note that 0 ≤ ∆ l < ε for all l ∈ M 0 (p, ε), since M 0 ⊆ M, the set of -optimal arms. Therefore, we can further simplify (60) to for l ∈ M 0 (p, ε).

Conclusion
We have presented a refined analysis of best-arm identification following the gap-based approach of [14], but incorporating refinements that circumvent some weaknesses, leading to a bound matching the divergence-based approach [15] in many cases. It would be of interest to determine whether further refinements could allow this approach to match [15] in all cases, or the extent to which the gap-based approach extends beyond Bernoulli rewards and/or beyond the standard best-arm identification problem (e.g., to ranking problems [21]).
We start by simplifying the main assumption of the lemma: where (A3) follows from Markov's inequality, and (A4) follows from the definition of A l in (32). It follows from (A4) that We define and we will find it convenient to treat the cases l ∈ V and ł / ∈ V separately. For l / ∈ V, from (30) and (33), we have A l ∩ C l = A l since θ ∈ (0, 1), ξ > 0, and G 2,l = 0, and it immediately follows from (A5) that On the other hand, for l ∈ V, we can simplify the definition ofα l in (27) as follows: where (A10) follows from (25), and (A11) follows from the definition of the set V in (A6). It follows that for all l ∈ V, where the second inequality in (A15) follows from p l ≤ p * ≤ p * + ε and the definitions of α l and β l in (25) and (26), respectively. Similarly, for l ∈ V, we can simplifyβ l from (28) as follows: where (A17) follows from (A15), (A18) follows from (27), and (A19) again uses (A15). It follows that for all l ∈ V. Now, let l − p l ), j = 1, 2, · · · (A21) Then, we have In addition, we note that Z l , · · · , are a i.i.d. sequence by the i.i.d. property of X (1) l , X l , · · · . For each positive integer t l , let K l,t l := ∑ t l j=1 X (j) l , and define Observe that where (A28) follows since a Bernoulli(ρ) variable has variance ρ(1 − ρ).
We are now ready to upper bound P 1 [C c l ∩ A l ] for l ∈ V: where: • (A30) uses the definitions of C l and G 2,l ; • (A32) follows from (A15) and (A20), along with 1{p l T l > K l } + 1{p l T l ≤ K l } = 1; • (A33) uses the definitions of U l and A l ; • (A34) follows from the definitions of U l and V l (t l ) in (A23) and (A24) (which imply U l = V l (T l )); • (A35) follows from (A26); Defining n l = 2(p * +ε)