Randomized allocation with arm elimination in a bandit problem with covariates

: Motivated by applications in personalized web services and clinical research, we consider a multi-armed bandit problem in a setting where the mean reward of each arm is associated with some covariates. A multi-stage randomized allocation with arm elimination algorithm is proposed to combine the ﬂexibility in reward function modeling and a the- oretical guarantee of a cumulative regret minimax rate. When the function smoothness parameter is unknown, the algorithm is equipped with a histogram estimation based smoothness parameter selector using Lepski’s method, and is shown to maintain the regret minimax rate up to a loga- rithmic factor under a “self-similarity” condition.


Introduction
The multi-armed bandit problem is an optimization game with promising applications in, e.g., web services and clinical research. Under a prototypical framework, a bandit problem consists of several gambling machines, and the underlying reward distribution of each machine is unknown to the game player. Each time, the player can pull only one of the machine arms to receive reward. Given a finite number of times to play the machines, the goal is to devise a sequential arm allocation algorithm to maximize the cumulative reward, and equivalently, to minimize the cumulative regret (the shortfall of the reward of the algorithm compared to an oracle). A balance between exploration and exploitation is usually required for a bandit problem algorithm.
The standard setting of a bandit problem assumes that the reward response of each arm is "homogeneous" with no available covariates. Since the seminal work of Robbins (1954), the standard bandit problem is studied extensively, the representative early work of which includes Lai and Robbins (1985), Berry and Fristedt (1985), Gittins (1989) and Auer, Cesa-Bianchi and Fischer (2002). See also Cesa-Bianchi and Lugosi (2006) and Bubeck and Cesa-Bianchi (2012) for recent reviews of its various extensions. The "homogeneity" assumption of the standard setting, however, can be too restrictive in real applications. An increasingly popular but a much less studied setting is to assume that the mean reward is associated with some covariates, that is, the game player is given a d-dimensional covariate x ∈ R d as additional information before deciding which arm to pull, and the expected reward of a bandit arm given covariate x takes a functional form f (x). Such variant of bandit problem is called multi-armed bandit problem with covariates, or MABC for its abbreviation.
The MABC problem first appears under a parametric framework in Woodroofe (1979). Attracted by promising applications in personalized web and medical services, more and more attentions are directed to the MABC problem in recent years. For example, with settings more flexible than that of Woodroofe (1979), a linear response bandit problem is recently studied under a minimax framework with margin conditions (Goldenshluger and Zeevi, 2009;Goldenshluger and Zeevi, 2013 and references therein). The well-known upper confidence bound (UCB) type algorithms are also extended to linear parametric settings, and are studied empirically in, e.g., Li et al. (2010).
The MABC problem from a nonparametric perspective is initiated by Yang and Zhu (2002). They propose a randomized allocation algorithm with histogram and K-nearest neighbor methods, the cumulative reward of which is shown to be asymptotically equivalent to that of an oracle. Although it is a very flexible and often effective algorithm, a finite-time regret analysis by Qian and Yang (2016) suggests that it may converge sub-optimally in terms of the minimax rate of the regret established by Rigollet and Zeevi (2010) due to its over-exploration in the randomization process. Perchet and Rigollet (2013) propose algorithms with an important step of arm elimination that originally appeared in a standard bandit problem setting (Even-Dar, Mannor and Mansour, 2006). They provide more rigorous and fine-tuned arguments for the standard setting, and further obtain performance bounds for their arm elimination algorithms devised to deal with the MABC problem. In particular, by a dyadic binning process, their adaptively binned successive elimination (ABSE) algorithm achieves the regret minimax rate, and is adaptive to a margin condition. The aforementioned nonparametric MABC algorithms, however, all assume a known Hölder smoothness condition on the mean reward functions. It is of interest to find algorithms that are adaptive to both the smoothness and the margin conditions.
Other settings of MABC problems have been studied in, e.g., Langford and Zhang (2007) and Dudik et al. (2011), where algorithms are designed to target the performance of the best arm-pulling policy among a class of finitely many candidate policies. May et al. (2012) study MABC from a Bayesian perspective.
In addition, differently from the MABC problem, a related setting considers the arm space (with possibly infinitely many arms) instead of the covariate space (see, e.g., Dani, Hayes and Kakade, 2008;Rusmevichientong and Tsitsiklis, 2010;Auer, Ortner and Szepesvári, 2007;Kleinberg, Slivkins and Upfal, 2007). The bandit problem that considers the joint covariate and arm space is studied in Lu, Pál and Pál (2010) and Slivkins (2011).
In this article, we follow the line of nonparametric MABC problem. The primary task is to address the question of whether we can achieve a near minimax optimal regret upper bound without the prior knowledge of the smoothness parameter. Our solution to this question is closely related to the adaptive nonparametric estimation technique pioneered by Lepski (1990). The "Lepski-type" method is recently studied in Giné and Nickl (2010), Hoffmann and Nickl (2011) and Bull (2012), and a "self-similarity" condition is used for establishing the adaptive confidence bands in both density estimation and regression problems. As the most important contribution of this work, we propose the strategy of integrating the Lepski's method with a nonparametric MABC algorithm, and show that under a "self-similarity" condition, the resulting cumulative regret can adaptively achieve the minimax rate up to a logarithmic factor. In particular, the ABSE algorithm (Perchet and Rigollet, 2013) can be used for adaptively achieving a near minimax rate when equipped with the Lepski-type smoothness parameter selector (see Remark 5.1).
It is noted that the regret minimization in the MABC problem differs from the usual purpose of nonparametric function estimation, but shares the difficulties involved in establishing adaptive confidence bands. A more detailed discussion regarding the connection of the adaptive nonparametric estimation with the MABC problem is deferred to section 6.
We present the proposed strategy using a nonparametric MABC algorithm called randomized allocation with arm elimination (or RAAE for abbreviation). Motivated by the observation in Qian and Yang (2016) that using randomized allocation strategy alone may give sub-optimal rate for the cumulative regret, the RAAE algorithm is proposed to embed the key arm-elimination technique developed in Perchet and Rigollet (2013) into the randomized allocation and can be shown to achieve the same minimax rate as the ABSE (with known smoothness). In our view, the feature of randomized allocation procedure (in addition to arm elimination) is practically useful because it provides a user with additional flexibility of applying a regression modeling method (e.g., kernel regression) for each arm to further exploit the response-covariate association. The practical implications of the randomized allocation step in RAAE are discussed in Remark 3.1 and are numerically illustrated in Appendix B with simulation examples.
The remainder of this article is organized as follows. The MABC problem setup is introduced in section 2. The RAAE algorithm and the integrated smoothness parameter selector are described in sections 3 and 4, respectively. The finite-time regret analysis is done in section 5. A final discussion is given in section 6. The technical lemmas and proofs are given in Appendix A and a simulation experiment regarding the randomized allocation in RAAE is shown in Appendix B.

Problem setup
Consider an l-armed bandit problem (l ≥ 2) and suppose the covariates take values in the hypercube [0, 1] d . Let f i (x) denote the (conditional) mean reward function of an arm i (1 ≤ i ≤ l) given a covariate x. We model the observed reward as f i (x) + ε, where ε is the random error with mean 0. The mean reward functions and the random error distributions are unknown.
Let {X n , n ≥ 1} be a sequence of independent covariates with an unknown probability distribution P X supported in [0, 1] d . Given any time point n (n ≥ 1), let Y i,n denote the observed reward from pulling arm i (1 ≤ i ≤ l), and let I n denote the arm chosen by a sequential allocation rule η. The MABC problem works as follows at each time point n. First, the covariate X n is observed. Based on X n and the previous observations (X j , I j , Y Ij ,j ), 1 ≤ j ≤ n − 1, the allocation rule η is subsequently applied to decide which arm to pull. Then, the game player pulls the chosen arm I n and receives the corresponding reward Y In,n . The received reward is generated by Y In,n = f In (X n ) + ε n , where ε n is the random error, and (X n , ε n ) is independent of the previous observations. We assume the covariate and the random error satisfy the following conditions. Assumption 2.1. The design distribution of the covariate is dominated by the Lebesgue measure with a continuous density p(x) uniformly bounded above and away from 0 on [0,1] Assumption 2.2. The errors satisfy a (conditional) moment condition that there exist positive constants v and c such that for all integers k ≥ 2 and n ≥ 1, Assumption 2.1 is used by the smoothness parameter selector to ensure that the histogram estimation is close to the true reward function uniformly. Assumption 2.2 is a (conditional) moment assumption known as refined Bernstein condition (e.g., Birgé and Massart, 1998). Note that under Assumption 2.2, the random error can be dependent on the covariate, and is not necessarily bounded. When the response is bounded (e.g., binary), Assumption 2.2 trivially holds. In general, it is satisfied if the error has a finite exponential moment, and thus allows error distributions with tails heavier than normal distribution.
Define, at given x, i * (x) = argmax 1≤i≤l f i (x) to be the best arm, f * (x) = f i * (x) (x) to be the best mean reward, and let w = sup 1≤i≤l sup x∈ [0,1] ). We measure the performance of an allocation rule η using cumulative regret R n (η), per-round regret r n (η) and inferior sampling rate q n (η), which are defined by and q n (η) = 1 n n j=1 respectively. Next, we introduce a Hölder smoothness condition and a margin condition, both of which have been studied in the context of nonparametric estimation (Audibert and Tsybakov, 2005;Audibert and Tsybakov, 2007) and classification (Mammen and Tsybakov, 1999;Tsybakov, 2004). Let · be the sup-norm on a d-dimensional vector. Suppose κ * and κ * are two known constants satisfying 0 < κ * < κ * ≤ 1. Given κ ∈ [κ * , κ * ] and ρ > 0, define Σ(κ, ρ) to be the class of functions that satisfies the following Hölder smoothness condition: for for every x 1 , x 2 ∈ [0, 1] d . As mentioned in the introduction, to our knowledge, existing nonparametric MABC algorithms all require the knowledge of κ for optimal properties. However, such information is typically not available to the game player. Efforts are made to provide a proper estimate for κ in section 4. The margin condition has also been used in the MABC problem to control the game complexity (Goldenshluger and Zeevi, 2009;Perchet and Rigollet, 2013).
Assumption 2.3. There exist α ∈ (0, d/κ], t 0 ∈ (0, 1) and c 0 > 0 such that Larger α in Assumption 2.3 indicates an easier MABC game in the sense that except on a subset of the domain with a small P X -probability, it happens that either all the mean rewards are the same for all arms, or the optimal mean reward is well-separated from the sub-optimal ones. In particular, when α > d/κ, one arm dominates over the entire domain (Perchet and Rigollet, 2013, Proposition 3.1) and the standard bandit problem algorithms will suffice in this case. Since this simple situation is not the interest of this article, we assume that α ≤ d/κ.
Next, we want to devise an algorithm that does not rely on the knowledge of either κ or α, but still achieves the (nearly) optimal regret cumulative rate as if we knew them in advance.

Algorithm
The algorithm consists of a forced sampling step followed by a randomized allocation with arm elimination mechanism. Suppose N is the total time horizon. The algorithm starts with a forced sampling step, in which every arm is pulled n 0 times (1 ≤ n 0 N ). The random sample of each arm thus obtained feeds into a smoothness parameter selector, which can be subsequently used to choose related parameters of the remaining steps. After the forced sampling step, the remaining time horizon is divided into T + 1 stages. LetÑ 1 <Ñ 2 < · · · <Ñ T be the end time points of the first T stages, and defineÑ 0 = n 0 l. The number of time points By the choice of bin width sequence, we can see that for each bin B ∈ B t (1 ≤ t ≤ T ) and each stage s (0 ≤ s < t), there is a unique (larger) bin B ∈ B s that contains B. We denote B by p s (B) and call it the "parent" bin of B at stage s. Let {π n , 1 ≤ n ≤ N } be a sequence of positive numbers satisfying (l − 1)π n < 1 for every 1 ≤ n ≤ N . The algorithm for MABC works as follows.
Step 0. Initialize the game with the forced sampling step.
Step 0.1. Obtain a random sample of each arm by pulling each arm n 0 times.
Step 0.2. If the smoothness parameter κ is unknown, for every given arm i (1 ≤ i ≤ l), estimate κ by the smoothness parameter selector described in section 4. The resulting estimate for arm i is denoted bŷ κ (i) . Defineκ * = min 1≤i≤lκ (i) , which is used to determine parameters of the following steps. If κ is known, simply setκ * = κ.
Step 1. Define the initial set of active arms in bin X to be S X = {1, 2, · · · , l}.
Step 1.1. Observe covariate X n and locate the bin with bin width h t−1 that contains X n by B = B t−1 (X n ). Find S B , the set of active arms in bin B. Denote the number of arms in S B by l B .
Step 1.2. For each arm i ∈ S B , based on the previously obtained sample of covariates and rewards, estimate the mean reward f i (X n ) by some user-specified regression modeling method (e.g., kernel regression). The estimator is denoted byf i,n (X n ).
Step 1.3. Estimate the best arm, select and pull. Defineî n = argmax i∈S B f i,n (X n ) (If there is a tie, any tie-breaking rule may apply). Choose an arm, with probability 1−(l B −1)π n for armî n (the currently most promising choice) and with probability π n for each of the remaining arms in S B . That is, Then pull the arm I n to receive the reward Y In,n .
Step 2. At the end of stage t, perform arm elimination for the bins in B t (with bin width h t ). For each bin B ∈ B t , do the following substeps.
Step 2.1. Identify the parent bin B = p t−1 (B) and the set of active arms S B for bin B . Step Calculate the sample average of each arm i ∈ S Step 2.3. Identify the set of "bad" arms to be eliminated by where α t is a stage-dependent parameter. Obtain the set of active arms in bin B for the next stage by eliminating "bad" arms in Step 3. Repeat Step 1 and Step 2 for stage t = 2, 3, · · · , T .
The forced sampling step obtains a random sample of each arm for the smoothness parameter selector. After the forced sampling step, T + 1 stages of randomized allocation with arm elimination follow. For a given stage t (1 ≤ t ≤ T +1), Step 1 performs the randomized arm allocation. Specifically, Step 1.1 retrieves the set of active arms inherited from the previous stage. In particular, for stage t = 1, the set of active arms includes all the candidate arms. In Step 1.2, we have the flexibility to choose proper regression methods to estimate the mean reward functions of the active arms. Both parametric and nonparametric methods may apply. Step 1.3 is the randomized allocation that favors the arm with highest estimated reward and selects this arm with a high probability. At the end of a given stage t (1 ≤ t ≤ T ), Step 2 follows to identify and eliminate the obvious bad-performing arms so that they do not get pulled in the next stage. For this purpose, the covariate domain is divided into 1/h d t bins with bin width h t . For each of these bins, Step 2.2 calculates the reward sample average of each active arm during stage t. Subsequently, Step 2.3 eliminates the arms with low sample average compared to the highest. The remaining arms of each bin after elimination serve as the new active arms, and the next stage follows. Heuristically speaking, Step 2 assists the randomized allocation mechanism of Step 1 to decrease the number of times the bad-performing arms get selected. The choice of algorithm parameters including n 0 , T ,Ñ t and α t depends onκ * , and is described in section 5. Note also that the algorithm above implicitly assume that N >Ñ T . IfÑ T is chosen such that N <Ñ T , we simply stop the algorithm at n = N .
Remark 3.1. Here, we provide some detailed discussion regarding the practical relevance of the randomized allocation procedure shown in Steps 1.2-1.3. From the perspectives of minimax optimality, under our settings and with the current technical tools available, if π n 's are uniformly lower bounded by a positive constant (and upper bounded by 1/l B due to the natural requirement from randomized allocation), the RAAE algorithm can achieve the minimax regret rate of Rigollet and Zeevi (2010), irrespective of the regression modeling method chosen by the user. In particular, if we choose π n = 1/l B (that is, each active arm has equal chance to be pulled), then the information from Step 1.2 is effectively ignored and the RAAE algorithm essentially becomes analogous to ABSE in the sense that both algorithms tend to pull each active arm an equal number of times. Practically, we advocate the use of smaller π n to take advantage of the additional information gained from Step 1.2. For example, we may use kernel regression in Step 1.2 to estimate the reward function of each active arm. Then, in Step 1.3, if we choose π n = 0.05 ∧ 1 l B , the arm with the highest estimated reward from Step 1.2 is pulled with larger probability than that of other active arms in the randomized allocation (assuming l B < 20). Our empirical experience favors the latter choice of π n . Simulation examples are given in Appendix B for comparison of the two different scenarios of π n , with kernel regressions as the user-specified regression modeling method.

Smoothness parameter selector
Suppose f (x) is the mean reward function of a given arm, and a random sample {(X i , Y i ), i = 1, · · · , n} of this arm is observed during the forced sampling step. Recall that κ * and κ * (0 < κ * < κ * ≤ 1) are the known lower and upper bound of κ, respective.
First, we make the following definitions. Define two integers For any τ ∈ N, define u τ = 2 −τ , and let κ τ be the real number that satisfies u τ = n − 1 2κτ +d . Then, it is not hard to see that there exists a constant Δ > 0 such that κ τ − κ τ +1 ≤ Δ log n for any τ ∈ [τ * , τ * ]. Given τ , we evenly partition the domain into 1/u d τ bins with bin width u τ , and let D τ (x) denote the bin that contains x ∈ [0, 1] d .
Next, with any given x ∈ [0, 1] d and τ ∈ N, we can define a histogram estimator of f (x) byθ τ2 γ n for every τ 2 satisfying τ < τ 2 ≤ τ * }, (4.1) where · ∞ is the sup-norm, b 1 is a constant satisfying b 1 > 4ρ, and γ n = log n. Then the selected smoothness parameter for f ( . The smoothness parameter selector described above is essentially searching the largest possible u τ such that its corresponding estimator for f does not differ too much from that of all smaller u τ 's under sup norm. The resulting κτ after minor adjustment is used to approximate the smoothness parameter of the mean reward function. To understand how well the method above performs when the knowledge of κ is absent, consider a sub-class Σ 0 (κ, ρ) of Σ(κ, ρ) as follows. Given τ ∈ N and .
for every n > n H , where c * = κ * 2κ * +d . Proposition 4.1 indicates that with high probability, the estimated smoothness parameter is no more than O(log log n/ log n) smaller than κ, the largest possible smoothness parameter of the arm in Σ 0 (κ, ρ).

Finite-time regret analysis
The regret analysis of the RAAE algorithm relies on the appropriate choice of the corresponding parameters. Set the parameters as follows. Let n 0 = N c * and h 1 = 1. Let stage number T be (5.1) Take the threshold α t in Step 2.3 to be α t = 4ρhκ * t (Alternatively, we may is the number of times the arm with the maximum sample average is pulled in bin B during stage t, and c is some constant. For brevity, we only show the proof under the former choice of α t ). Set , whereγ t is a stage-dependent parameter chosen to make N t a positive integer. In particular, it suffices to assume whereC is a positive constant (not depending on N or l) and .
The cumulative regret rate in Theorem 5.1 matches the minimax rate obtained by Perchet and Rigollet (2013) up to a logarithmic factor. The additional logarithmic term is the price we pay for not knowing κ. If the value of κ is available, we simply setκ * = κ and the exact minimax rate can be achieved.
It is noted that the sample size n 0 used for the smoothness parameter selector in Step 0 of the RAAE algorithm has to be carefully chosen with the consideration of the subsequent steps. The sample size n 0 should be large enough so that the estimation of κ becomes accurate enough with a high probability before its subsequent use. On the other hand, n 0 should be small enough so that the regret from Step 0 can be controlled within the desired range. It is also worth mentioning that although the proposed algorithm appears to assume a known value for ρ, it suffices to know the upper and lower bound of ρ to obtain the same rate.
Remark 5.1. As is pointed out in section 1, the ABSE algorithm (Perchet and Rigollet, 2013) can also be used for adaptively achieving a near minimax rate when equipped with the Lepski-type smoothness parameter selector. Indeed, in the proof of Theorem 5.1, we can see that Step 0 essentially serves as a pluggedin estimator of the smoothness parameter κ, and, because of Proposition 4.1, the analysis can go through almost like we knew the true κ by using its estimator κ * (in place of κ).

Discussion
In the nonparametric MABC problem, as far as we know, no algorithms before this work have been shown to be minimax-rate optimal adaptively with respect to the unknown smoothness parameter κ. The Lepski's method is known to have successful applications in the context of adaptive nonparametric estimations. In the following, we discuss the connection of our proposed MABC algorithm with adaptive nonparametric estimation when the Lepski's method is applied.
In the context of the RAAE algorithm, heuristically speaking, under-estimation of κ results in overly small bin width so that the smoothness of the reward functions is not fully utilized. Over-estimation of κ leads to possible pre-mature elimination of good-performing arms, the probability of which cannot be properly bounded. Interestingly, in nonparametric estimation, the Lepski's approach also has to consider separately the events that its built-in selector generates too small or too large smoothness parameter estimates. The former event (i.e., under-estimation of κ) is usually considered the technically "complicated" case of the two in nonparametric estimation. Its counterpart in the MABC problem (see Lemma A.3) turns out to be straightforward because the event probability can be bounded tightly by using the moment condition (Assumption 2.2) and a Bernstein-type inequality. The observation that the former event has a tight probability is shared in, e.g., Lepski (1990) and Lepski, Mammen and Spokoiny (1997) under a Gaussian white noise model. On the other hand, the latter event (i.e., over-estimation of κ) is usually considered the technically "easy" case of the two in nonparametric estimation because of the straightforward use of the built-in selector's definition. But such "easy" results do not apply to the MABC problem since the over-estimation of κ will have adverse effects on subsequent procedures.
Indeed, the difficulty caused by the over-estimation of κ is shared in the adaptive confidence bound problems. If we only consider the Hölder condition without further assumptions, it is known that the adaptive confidence bound generally does not exist (Low, 1997). As one solution to overcome such difficulty, Giné and Nickl (2010) propose a "self-similarity" condition, and show that the functions that do not satisfy this condition can be a negligible subset of Hölder class (see Condition 3 and Proposition 4 in Giné and Nickl, 2010). It turns out that the function class Σ 0 (κ, ρ) defined in section 4 takes the form of their "selfsimilarity" condition. To see such connection, we consider the special case in the rest of the discussion that the covariate is univariate and has the distribution P X ∼ Uniform[0, 1].

Appendix A: Lemmas and proofs
The proofs of Proposition 4.1 and Theorem 5.1 are given in the sections A.1 and A.2, respectively. To keep this paper self-contained, we list the following two lemmas for convenience, and their proofs can be found in Qian and Yang (2016).
Lemma A.1. Suppose {F j , j = 1, 2, · · · } is an increasing filtration of σ-fields. For each j ≥ 1, let ε j be an F j+1 -measurable random variable that satisfies E(ε j |F j ) = 0, and let T j be an F j -measurable random variable that is upper bounded by a constant C > 0 in absolute value almost surely. If there exist positive constants v and c such that for all k ≥ 2 and j ≥ 1, E(|ε j | k |F j ) ≤ k!v 2 c k−2 /2, then for every > 0 and every integer n ≥ 1, Lemma A.2. Suppose {F j , j = 1, 2, · · · } is an increasing filtration of σ-fields.
For each j ≥ 1, let W j be an F j -measurable Bernoulli random variable whose conditional success probability satisfies

A.1. Proof of Proposition 4.1
Proposition 4.1 is a straightforward result of the following two lemmas.

Bandit problem with covariates
Given τ ∈ N, let M τ be the set of bins with bin width u τ that partition the domain. Clearly, |M τ | = 1/u d τ . Then, given any τ 2 and τ such thatτ − 1 ≤ τ ≤ τ 2 ≤ τ * , we have To derive the upper bound for the inequality above, note that if M τ (x) > 0, Let x * B be a fix point in bin B ∈ M τ , then the previous display implies that Then, where the second inequality follows by (A.3) and the last inequality follows by the fact that ρh κ τ < b 1 u κτ 2 τ2 γ n /4 for large enough n. Note that by Lemma A.1, .
As a result, where the last two inequalities follow by the observation that nu d τ u In together with (A.1) andκ > κ− Δ log n , we know that there exists n * and some constant C H such that for any n > n * . This completes the proof of Lemma A.3.
Lemma A.4. Suppose f (·) ∈ Σ 0 (κ, ρ) and Assumptions 2.1 and 2.2 hold. Then forκ obtained by procedures in section 4, there exists an integer n * and a con- Proof of Lemma A.4. Letτ ,κ andκ be defined as in the proof of Lemma A.3. Let κ = κ + b2 log log n log n . Define the integer τ = max{τ : 2 τ ≤ n 1 2κ +d } Then by definition in (4.1) and the fact that τ <τ , Recall from the proof of Lemma A.3 that there is a constant C H1 such that It remains to find the upper bound for P ( θ τ − f ∞ ≤ 3 2 b 1 uκ τ γ n ). Note that by triangle inequalities, The previous inequality implies that for large enough n, where the second to last inequality follows by that f ∈ Σ 0 (κ, ρ), and the last inequality follows because Also, by derivations similar to that of (A.5) and (A.6), for all large enough n. Similarly, we can apply Azuma's inequality to obtain that for all large enough n. Then, by (A.9), (A.10) and (A.11), Together with (A.7) and (A.8), which completes the proof of Lemma A.4.
Proof of Proposition 4.1. By Lemma A.3 and Assumption 4.1, Together with Lemma A.4 and the fact that there exists f i ∈ Σ 0 (κ, ρ), the proof of Proposition 4.1 is complete.

A.2. Proof of Theorem 5.1
Proof of Theorem 5.1. Let V 0 = {κ − Δ log n − b2 log log n log n <κ * ≤ κ}. Inspired by the technique employed in the proof of Theorem 5.1 in Perchet and Rigollet (2013), we define some sets and events as follows. For every bin B ∈ B T (at stage T ), recall that p t (B) is the parent bin of set B at stage t, and S pt(B) is the set of arms in p t (B) that survive the stage t arm elimination. Then, for every bin B ∈ B T and every t (1 ≤ t ≤ T ), define the sets of arms and define the events Here, we consider G t,B,1 and G t,B,2 as "good" events because G t,B,1 means that all possible best arms in bin p t (B) survive the stage t arm elimination, and G t,B,2 means that all survived arms in S pt(B) have regret no larger than 8ρhκ * t . Further define the sets The set A t,B means that the "good" events happen at stage t, and F t,B means that such "good" events happen during all of the first t stages. Note that .14) and Then, by the tree diagram, Next, we provide upper bounds for R 1 , R 2 , · · · , R T +1 . By definition, Let E (0) (·) and P (0) (·) denote the conditional expectation and conditional probability givenκ * = κ 0 (κ − Δ log n − b2 log log n log n < κ 0 ≤ κ), respectively. Then, by where the last inequality follows by Lemma A.5. Similarly, by definition, for 2 ≤ t ≤ T , where the second to last inequality follows by the definition of event F t−1,B . Then, by conditional independence of the event where the first inequality follows by Assumption 2.3, and the second inequality follows by Assumption 2.3, Lemma A.5 and the choice of Similarly, by definition,
The proof of Theorem 5.1 above needs the following lemma.
,1 implies that there exists an arm i 1 ∈ S t,B,1 such that arm i 1 is eliminated at the end of stage t (within bin p t (B)). For notation brevity, denote p t (B) byB. Recall that if NB ,i = 0, we haveȲB ,i = n∈HB ,i Y i,n /NB ,i . Then, by the arm elimination mechanism, there exists an arm i 2 ∈ SB such thatȲB Then, since NB ,i1 = 0 and NB ,i2 = 0, Given 1 ≤ i ≤ l, for notation brevity, define C (i) For the upper bound of the first term in (A.29), note that , (A.30) where the last inequality follows by Lemma A.2 and the fact that P (X n ∈ B, I n = i | C (i) t−1 ) ≥ ch d tπt for allÑ t−1 +1 ≤ n ≤Ñ t . To provide the upper bound for the second term in (A.29), define HB = {n :Ñ t−1 + 1 ≤ n ≤Ñ t , X n ∈B} to be the set of time points during stage t at which the covariates fall into biñ B. Let NB be the size of HB. Then, where P X t (·) denotes the conditional probability given (X Nt−1 + 1, X Nt−1 + 2, · · · , X Nt ), C (i) t−1 and {κ * = κ 0 }, and E c (·) denotes the conditional expectation given C (i) t−1 and {κ * = κ 0 }. Since P (X n ∈B) ≥ ch d t , by Lemma A.2, Case 2. Suppose a three-armed bandit with d = 1 generates 0-1 binary responses using the following (conditional) mean reward functions with κ = 0.5: f 2 (x) = −f 1 (x) and f 3 (x) = 0.5, where m = 4, 10, 20, or 40. All the other settings of Case 2 remain the same as that of Case 1.
In this focused illustration with RAAE, κ is known to the user and set n 0 = 20,γ t = 1, ρ = 0.5 or 1. The Nadaraya-Watson regression with Gaussian kernel is applied as the user-specified regression modeling method for each active arm in Step 1.2, and at each time point n, the bandwidth is N −1/(2κ+d) i,n , where N i,n is the total number of times arm i is pulled before the time point n. To compare the performance of using π n = 1/l B versus π n = 0.05, we run the algorithm 100 times for each choice of π n . The averaged per-round regretr N and the averaged inferior sampling rateq N are computed over the 100 runs. All the numerical work was implemented in C++ and the code is available upon request.
Based on the results summarized in Table 1, we can see that in both cases, the choice of π n = 0.05 (which uses the information obtained in Step 1.2 with Nadaraya-Watson regression) outperforms the choice of π n = 1/l B (which ignores Step 1.2 and pulls each active arm with equal probability). Here, the RAAE algorithm shows its practical potential to improve algorithm performance by effectively employing user-specified regression modeling methods such as the Nadaraya-Watson regression to differentiate the active arms.