Gaussian Process bandits with adaptive discretization

In this paper, the problem of maximizing a black-box function $f:\mathcal{X} \to \mathbb{R}$ is studied in the Bayesian framework with a Gaussian Process (GP) prior. In particular, a new algorithm for this problem is proposed, and high probability bounds on its simple and cumulative regret are established. The query point selection rule in most existing methods involves an exhaustive search over an increasingly fine sequence of uniform discretizations of $\mathcal{X}$. The proposed algorithm, in contrast, adaptively refines $\mathcal{X}$ which leads to a lower computational complexity, particularly when $\mathcal{X}$ is a subset of a high dimensional Euclidean space. In addition to the computational gains, sufficient conditions are identified under which the regret bounds of the new algorithm improve upon the known results. Finally an extension of the algorithm to the case of contextual bandits is proposed, and high probability bounds on the contextual regret are presented.


Introduction
We consider the problem of maximizing a function f : X → R from its noisy observations of the form y t = f (x t ) + η t , t = 1, 2, . . . , n, where η t is the observation noise at time t. We work in the Bayesian setting, assuming that the function f is a sample from a zero mean Gaussian Process (GP) indexed by the space X , and η t for t ≥ 1 are i.i.d. N (0, σ 2 ) Gaussian random variables. We further assume that the function f is expensive to evaluate, and we are allocated a budget of n function evaluations. This problem can be thought of as an extension of the Multi-armed bandit (MAB) problem to the case of infinite (possibly uncountable) arms indexed by the set X , and is referred to as the GP bandits problem . The goal is to design a strategy of sequentially selecting query points x t ∈ X based on the past observations {(x i , y i ); 1 ≤ i ≤ t − 1} and the prior on f . As in the case of MAB with finitely many arms, the performance of any query point selection strategy is usually measured by the cumulative regret R n , which forces the agent to address the exploration-exploitation trade-off: (2) An alternative measure of performance is the simple regret S n which is used in the Bayesian Optimization (BO) or the pure exploration problem:

Prior work
Optimizing a black-box function from its noisy observations is an active area of research with a large body of literature. Here, we review existing methods which take a Bayesian approach with GP prior to this problem, and have provable guarantees on their performance. Srinivas et al. (2012) formulated the task of black-box function optimization as a MAB problem and proposed the GP-UCB algorithm which is a modification of the Upper Confidence Bound (UCB) strategy widely used in bandit literature. The algorithm constructs high probability UCBs on the function values using the GP posterior and selects the evaluation points by maximizing the UCB over X . For finite search spaces X they showed that the GP-UCB algorithm admits a high probability upper bound on the cumulative regret of the form: where γ n is the maximum information gain with n evaluations. We will refer to cumulative regret bounds of this form as information-type regret bounds in this paper. In addition, to make the dependence on n explicit, Srinivas et al. (2012) further derived bounds on the term γ n for some commonly used kernels. Finally, they presented an extension of the GP-UCB algorithm to the case of continuous X by applying it on a sequence of increasingly fine uniform discretizations of X .
Follow up works to Srinivas et al. (2012) have extended the GP-UCB algorithm in several ways. Contal and Vayatis (2016) proposed a method of constructing a sequence of uniform discretizations with tight control over the approximation error, which allowed the extension of the GP-UCB algorithm to arbitrary compact metric spaces X . Desautels et al. (2014) and Contal et al. (2013) considered the GP bandits problem with the additional assumption that the evaluations can be performed in parallel. Desautels et al. (2014) proposed the GP-BUCB algorithm which selects the points in a batch sequentially by maximizing a variant of the UCB, which is computed by keeping the mean function fixed and only updating the posterior variance. Contal et al. (2013) proposed the GP-UCB-PE which uses the UCB function for selecting the first point of a batch, and then proceeds in a greedy manner selecting the remaining points by maximizing the posterior variance. Krause and Ong (2011) proposed and analyzed the CGP-UCB algorithm for the contextual GP bandits problem, where the mean reward function corresponding to context-action pairs is modeled as a sample from a GP on the context-action product space. Kandasamy et al. (2016) considered a multi-fidelity version of the GP bandits problem in which they assumed the availability of a sequence of approximations of the true function f with increasing accuracies which were cheaper to evaluate. They proposed an extension of GP-UCB called the MF-GP-UCB and derived information-type bounds on its cumulative regret. Wang et al. (2016) proposed the GP-EST algorithm which looks at the optimization problem through the lens of estimation. In particular, the algorithm constructs an estimate of the maximum function value f (x * ), and then selects a point for evaluation which has the largest probability of attaining this value. Russo and Van Roy (2014) analyzed the performance of the Thompson Sampling algorithm to a large class of problems, including the GP bandits problem. Thompson Sampling is a randomized strategy in which query points are sampled according to the posterior distribution on x * . Since computing the posterior on x * may be complicated, in practice, the query points are selected in the following two step procedure: first, a samplef t of the unknown function f is generated, and then the query point x t is chosen by maximizingf t over X . For the case of continuous X , the function samples are generated over uniform discretizations X t of X . By observing a relation between the expected regret of Thompson Sampling and UCB strategies, Russo and Van Roy (2014) obtained information-type bounds on the expected cumulative regret of the Thompson Sampling algorithm for GP bandits.
As observed in (Bubeck et al., 2011a), bounding the cumulative regret automatically gives us a bound on the expected simple regret by employing a randomized point recommendation strategy. Additionally, for the pure exploration setting, several algorithms specifically geared towards minimizing S n , such as Expected Improvement (GP-EI), Probability of Improvement(GP-PI), Entropy Search and Bayesian Multi-Scale Optimistic Optimization (BaMSOO) have been proposed (see (Shahriari et al., 2016) for a recent survey). Bogunovic et al. (2016b) considered the BO and Level Set Estimation problems in a unified manner and proposed the Truncated Variance Reduction (TRUVAR) algorithm which selects evaluation points greedily to obtain the largest reduction in the sum of truncated variances of the potential maximizers. The performance of all these algorithms have been empirically studied over various synthetic as well as real-world datasets. Furthermore, theoretical guarantees are also known for GP-EI (Bull, 2011) and BaMSOO (Wang et al., 2014) with noiseless observations, and for TRUVAR (Bogunovic et al., 2016b) with noisy observations and non-uniform cost of evaluations.
All the algorithms above, with the exception of BaMSOO, require solving an auxiliary optimization problem in each round t for selecting the query point x t . The objective function of this auxiliary optimization problem is usually non-convex and multi-modal and hence requires an exhaustive search over an increasingly fine sequence of uniform discretizations to guarantee that a close approximation of the true optimum is found Contal and Vayatis, 2016). The size of these uniform discretizations increases exponentially with the dimension of X . This is because these discretizations are chosen off-line and do not depend on the function evaluations made up to round t. In contrast, BaMSOO adaptively constructs discretizations by locally refining the regions of X in which f is more likely to take higher values based on the observations. As a result, the size of the discretizations under BaMSOO are independent of the dimension of X which leads to significantly lower computational costs when X is high dimensional. Our work is strongly motivated by this aspect of BaMSOO to provide the first algorithm for GP bandits with noisy observations whose computational complexity remains independent of the dimension of X .

Our contributions
In this paper, we address two issues with existing approaches to the GP bandits problem: 1. As discussed above, all the existing algorithms for GP bandits require solving an auxiliary optimization problem over the entire search space for selecting a query point which may be computationally infeasible, and thus practical implementations resort to various approximation techniques which do not come with theoretical guarantees.
2. Furthermore, by constructing specific Gaussian Processes we show that the information-type regret bounds can be too pessimistic, thus motivating the need for designing algorithms that admit alternative analysis techniques.
To tackle these two problems, we design algorithms for GP bandits which utilize ideas from existing works in the Lipschitz function optimization literature, such as (Bubeck et al., 2011b;Munos, 2011;Munos et al., 2014;Kleinberg et al., 2013). More specifically, our main contributions are as follows: • We first present an algorithm for GP bandits which employs a tree of partitions of the search space X to adaptively refine it based on observations. We show that because of the adaptive discretization, when X ⊂ R D and D is large, our algorithm has significantly less computational complexity than algorithms requiring auxiliary optimization.
• We obtain high probability bounds on the cumulative regret of our algorithm which are always as good as, and in some cases strictly better than, the existing regret bounds. In particular, we obtain the first explicit sublinear regret bounds for the GP with exponential kernel (Ornstein-Uhlenbeck process) and also identify sufficient conditions under which our bounds improve upon the current ones for Matérn family of kernels.
• We also derive high probability bounds on the simple regret for our algorithm. To the best of our knowledge, BaMSOO (Wang et al., 2014) is the only adaptive 1 algorithm for the black-box optimization problem in the Bayesian setting, for which theoretical guarantees on simple regret are known. Our algorithm matches BaMSOO's performance with the additional advantages that it requires fewer assumptions on the covariance functions and can work with noisy observations.
• We also study two extensions of our algorithm. First, we present a Bayesian Zooming algorithm based on (Kleinberg et al., 2013;Slivkins, 2014) and obtain theoretical guarantees on its regret performance. This algorithm assumes a covering oracle access to the metric space X instead of requiring a hierarchical tree of partitions of X . We then extend our algorithm for GP bandits to the contextual GP bandits and obtain bounds on the contextual regret.
• Finally, our algorithms and the theoretical bounds rely on a set of technical results about Gaussian Process which may be of independent interest. We provide these results and discuss their implications in Section 6.

Toy examples
As mentioned earlier, our cumulative regret bounds for Matérn kernels improve upon the known information type bounds for GP bandits. In this section, we attempt to provide some intuition for this result. In particular, we construct two toy examples which serve to highlight a potential drawback of the information type regret bounds for GP bandit problems shown in (4). The information-type regret bounds (4) depend on the maximum information gain γ n which is defined as: Here I(f ; y x[1:n] ) is the mutual information between the unknown function f and vector of observations y x[1:n] corresponding to the n query points x[1 : n]. This term depends on the covariance function 2 of the Gaussian Process (GP), and upper bounds on γ n for many commonly used GPs are given in . We note that since our aim is to gather information about a maximizer x * of f , and not necessarily about the behavior of f over the entire space X , information-type regret bounds can be quite loose. We present two examples which have been specifically constructed to illustrate the scenarios where the regret bounds implied by (4) are very pessimistic. Both examples utilize the fact that the maximum information gain (γ n ) can be large if the Gaussian Process has many independent components, even when the maximizer may be simple to learn. For our first example, we construct a GP whose samples have simple structure around the maximum despite the highly complex structure away from the maximizer. More specifically, we begin by dividing the interval [0, 1] into three equal subintervals. Over the second and third intervals, the GP sample varies smoothly as scaled and shifted versions of a smooth function ϕ(·), modulated by a Standard Normal random variable X 1 . The first subinterval is further divided into three parts, and this process continues infinitely.
For this GP, we can claim the following (details in Appendix-A.1): • For the choice of a i described in Appendix A.1, we have γ n = Ω nσ 2 log(n) , which means that the information-type bound (4) on the cumulative regret is linear in n.
2 we will use the terms covariance functions and kernels interchangeably • On the other hand, if a 1 >> a i for i ≥ 2, then the true maximizer x * ∈ {1/2, 5/6} with high probability, and it can be identified with just one function evaluation implying a constant cumulative regret, R n ≤ O(1).
For our second example, we construct a GP in which the search space is partitioned at different scales, and statistically equivalent components are assigned to the sets of a given partition. This process is repeated with increasingly finer partitions, and we show that for certain choice of parameters, each observation of the GP sample results in diminishing the region of uncertainty associated with x * by a constant factor. However, the information-type bound again is dominated by the information obtained from the large number of independent components of the GP and gives a linear upper bound on the cumulative regret.
Example 2. We again take X = [0, 1] and let ϕ 1 denote the following function where ϕ is the function used in Example 1. Let us now define a GP = {f (x)|x ∈ X } recursively as follows: As before (a i ) i≥1 is a decreasing sequence of positive real numbers, and (X i ) i≥1 are i.i.d. Standard Normal random variables. For this example, we can claim the following: • If the noise variance σ 2 is small enough, we have γ n ≥ Ω n which implies a linear in n information-type bound on cumulative regret.
• With the choice of parameters (a i ) i≥1 described in Appendix A.2, we can select the evaluation points in such a way that with high probability after every observation, the size of the region containing x * shrinks by a factor of 3, which in turn implies that the cumulative regret satisfies R n ≤ O(log n).
Both our examples have been specifically crafted to highlight scenarios in which the information type upper bounds given in (4) may not reflect the actual performance of the algorithms due to its dependence on the term γ n . In Section 4.2 we further strengthen this observation by showing that the information-type regret bounds are loose for a practically relevant class of Gaussian Processes.
The rest of the paper is organized as follows: In Section 2 we introduce the required definitions and present some background for the problem. We then describe our algorithm for GP bandits and analyze its regret in Section 3. We discuss the behavior of our algorithm in some specific problem instances in Section 4. In Section 5 we study two extensions of our approach and analyze their performance. Finally, Section 6 contains some technical results which were used in designing our algorithms.

Preliminaries
In this section we recall some definitions required for stating the results, and fix the notations used.
Definition 1. A Gaussian Process is a collection of random variables {f (x); x ∈ X } which satisfy the property that (f (x 1 ), f (x 2 ), . . . , f (x m )) is a jointly Gaussian random variable for all {x 1 , x 2 , . . . , x m } ⊂ X and m ∈ N. A Gaussian Process is completely specified by its mean function For a comprehensive discussion about Gaussian Processes and their applications in machine learning, see (Rasmussen and Williams, 2006). Remark 1. Any zero mean Gaussian Process with covariance function K induces a metric d on its index set X , defined as which gives us the following useful tail bound for any x 1 , x 2 ∈ X and a ≥ 0: Next, we introduce some properties of any metric space (X, l) which will be used later on.
Definition 2. Suppose X is a non-empty set and l is a metric on X . Then we have the following: • A subset X 1 of X is called an r-covering set of X if for any x ∈ X , we have l(x, X 1 ) ≤ r where l(x, X 1 ) := inf{l(x, y) : y ∈ X 1 }. The cardinality of the smallest such X 1 is called the r-covering number of X with respect to l, denoted by N (X , r, l).
• The metric dimension of a space X with associated metric l is the smallest number D 1 such that we have for all r > 0, N (X , r, l) ≤ Cr −D1 for some C > 0.
For bounded subsets of R D with a metric l, the metric dimension coincides with the usual notion of dimension (van Handel, 2014, page 125). The metric dimension D 1 gives us a notion of dimensionality intrinsic to the metric space (X , l). We now present a function specific measure of dimensionality of (X , l).
Definition 3. Suppose X is a non-empty set, l is a metric on X and f is a function from X to R. Then • A subset X 2 of X is called an r-separated set of X if for any x 1 , x 2 ∈ X 2 we have l(x 1 , x 2 ) ≥ r.
The cardinality of the largest such set X 2 is called the r-packing number of X with respect to l, and is denoted by M (X , r, l).
Our definition of the near-optimality dimension is based on similar definitions used in existing works in literature such as (Bubeck et al., 2011b;Munos, 2011;Valko et al., 2013).
Remark 2. We note that for any (X , l) with finite metric dimension D 1 , by using volume arguments (van Handel, 2014, Lemma-5.13) we can show that D f (∆ 0 , ζ) ≤ D 1 . An example (Bubeck et al., 2011b, Example 3) where this inequality is strict is the following: consider Definition 4. We will call a compact metric space (X , l) well-behaved if there exists a sequence of subsets (X h ) h≥0 of X satisfying the following properties: The nodes x h+1,j for N (i − 1) + 1 ≤ j ≤ N i are called the children of x h,i , which in turn is referred to as the parent.
P3 We assume that the cells have geometrically decaying radii, i.e., there exists 0 < ρ < 1 and 0 < v 2 ≤ 1 ≤ v 1 such that we have From P 1 we can see that the cells {X h,i ; 1 ≤ i ≤ N h } partition the space X for every h ≥ 0, while P 2 implies that we get an increasingly fine sequence of partitions with increasing h. Finally P 3 imposes the condition that for any h, the points x h,i are evenly spread out in the space X . The subsets (X h ) h≥0 satisfying these properties are said to form a tree of partitions (Munos et al., 2014;Bubeck et al., 2011b).
Remark 3. We note that if X = [a, b] D ⊂ R D , and l is any metric on X , then X is well-behaved according to the above definition. The cells X h,i in this case are D dimensional hyper-rectangles such that X h+1,j for 1 ≤ j ≤ N h can be constructed from X h,i by dividing it along its longest edge into N equal parts.

Algorithm for GP bandits
We begin this section by describing the general outline of all the algorithms proposed in this paper in Section 3.1. Then we introduce our tree based algorithm for GP bandits and obtain high probability bounds on its regret in Section 3.2.

Symbol
Description posterior mean and variance functions R n Cumulative regret Section 1,(2) Parameters of the tree of partitions Section 2, Definition 4 Parameters of Algorithm 1 and Algorithm 3 L t the set of leaf nodes Section 3.2 parent node of x h,i β n multiplicative factor for confidence intervals Section 3.3.2, Claim 1 h max maximum depth of the tree -"-, (17) Index used for action selection in Algorithm 3 -"-, (38) X c , X a Context space and Action space Section 5.2 Parameters of Algorithm 2 A t Set of active points Section 5.1 upper bound on variation of f in B(x, r k , l) for any x ∈ X Claim 6

General approach
At any time t, we maintain a discretization (i.e., a finite subset) of X , denoted by X t . To each x ∈ X t , we have an associated confidence region denoted by Reg t (x), and an index Ind t (x) which is a high probability upper bound on the maximum value of the function f in Reg t (x). The index Ind t (x) depends on three quantities: (a) the actual function value at x, (b) the amount of uncertainty in the function value at x, and (c) the amount of variation in the function value in Reg t (x). We proceed as follows: • In each round, we select a candidate point x t optimistically by maximizing Ind t (x) over X t .
• If the uncertainty in the function value at x t is smaller than the variation of f in the confidence region, it means that we must refine our discretization in the confidence region associated with x t .
• If, on the other hand, the uncertainty in the function value at x t is larger than the variation of f in the associated confidence region, our algorithm evaluates the function at this point to reduce this uncertainty.
In Section 3.2 we present an algorithm for GP bandits which uses a hierarchical partitioning scheme for locally refining the search space similar to (Munos et al., 2014;Bubeck et al., 2011b;Wang et al., 2014). Alternatively, the covering oracle based approach used by Slivkins (2014); Kleinberg et al. (2013) can also be employed for refining the discretization, and we describe such an algorithm in Section 5.1. We also apply this approach to design an adaptive algorithm for the Contextual GP bandits problem in Section 5.2.

Tree based Algorithm
We now describe our algorithm for GP bandits and derive high probability bounds on its regret. Our algorithm is motivated by several tree based methods that have been proposed for function optimization under Lipschitz-like assumptions, such as (Bubeck et al., 2011b;Munos, 2011;Munos et al., 2014). Assuming that the metric space (X , l) is well behaved, i.e., we have a sequence of subsets (X h ) h≥0 whose associated cells form a tree of partitions of X , we proceed as follows: • In every round t, the algorithm maintains an active set of leaf nodes denoted by L t , such that the cells of the nodes in L t partition X . This active set is initialized to L 0 = {x 0,1 } with the associated cell X 0,1 = X .
• The algorithm selects a node from L t by maximizing an index I t . Then index I t (x h,i ) is an upper confidence bound (UCB) on the maximum function value in cell X h,i and is defined as The termŪ t (x h,i ) in the above equation is a high probability upper bound on the function value at x h,i and is defined as where p(x h,i ) is the parent node of x h,i . For any h ≥ 0, the term V h is an upper bound on the maximum function variation in any cell X h,i at level h. Thus, we see thatŪ t (x h,i ) computes an upper bound on the value of f (x h,i ) in two ways and takes their minimum, while adding V h to it gives us an upper bound on the maximum function value in the cell X h,i .
• Having chosen the point (x ht,it ) according to the selection rule (Line-2 of Algorithm 1) we take one of the following two actions : -Refine: If β n σ t−1 (x ht,it ) ≤ V h , then the node x ht,it is expanded, i.e., the N children nodes {x ht+1,j : N (i t − 1) + 1 ≤ j ≤ N i t } of the node x ht,it are added to the set of leaves, and x ht,it is removed from it. (Lines 4-5 of Algorithm 1) -Evaluate: Otherwise, then the function is evaluated at the point x ht,it , i.e., we observe the noisy function value y t = f (x ht,it ) + η t and update the posterior distribution of f . (Lines 7-9 of Algorithm 1) The steps of the algorithm are shown as a pseudo-code in Algorithm 1. The algorithm maintains two counters, t which counts the total number of function evaluations and refinements, and n e which keeps track of the number of function evaluations. The algorithm stops after n function evaluations, and recommends a point from one of the deepest expanded cells (for minimizing S n ). The second condition on Line 3 of Algorithm 1 is added to prevent the (unlikely) scenario in which the algorithm keeps refining indefinitely without evaluating the function.

Algorithm 1: Tree based Algorithm for GP bandits
Input update posterior µ t (x) and σ t (x) 9 n e ← n e + 1 10 end 11 t ← t + 1 12 end Output : x(n): the deepest expanded node Remark 4. The parameter β n of Algorithm 1 requires the knowledge of the horizon or the budget n. However, we can use the well known doubling trick (Cesa-Bianchi and Lugosi, 2006, Section 2.3) to make our algorithm anytime without any change in the theoretical regret guarantees. The trick is to work in phases of exponentially increasing lengths, and applying the algorithm with known horizon (equal to the duration of the phase) in each phase.

Analysis of Algorithm 1
In this section, we first specify the assumptions on the covariance functions required for the theoretical analysis and then furnish the missing details of our tree based algorithm for GP bandits. Finally, we derive high probability bounds on the cumulative and simple regret for our algorithm.

Assumptions on the covariance functions
To analyze our algorithm, we will restrict our attention to a class of covariance functions, denoted by K, such that for any K ∈ K, we have:

A1
For any x, y ∈ X , we have d(x, y) ≤ g(l(x, y)) for some non-decreasing continuous function g : R + → R + , such that g(0) = 0. Recall that l is assumed to be any metric on the space X , and d is the natural metric induced on X by the zero mean GP with covariance function K.
A2 Moreover, we require that there exists a δ K > 0 such that for all r ≤ δ K , we have for constants C K > 0 and 0 < α ≤ 1 satisfying Assumption A2 informally requires that at least for small distances, points which are close in the metric l are also close in d. These assumptions are satisfied by all the commonly used kernels such as squared exponential (SE), and the Matérn family of kernels. It also includes other kernels such as K(r) = max(0, 1 − r) and the rational quadratic kernel K(r) = (1 + c 1 r 2 ) −c2 for some c 1 , c 2 > 0.
Remark 5. We note that K is closed under finite addition and multiplication operations. This is an important property as in many practical applications, often more than one kernels are combined through addition or multiplication to provide more accurate models (Duvenaud, 2014, Chapter-2), (Rasmussen and Williams, 2006). (8)) has a finite metric dimension D 1 = D 1 /α. This fact is used in Proposition 1 in Section 6.

Details of the algorithm
To complete the description of Algorithm 1, we need to specify the choice of the parameters h max , β n , and (V h ) h≥0 .
First we observe that for all t, we have |L t | ≤ M (X , v 2 ρ hmax , l). This follows from the assumption P 3 in Definition 4. From the definition of metric dimension we can upper bound M (X , v 2 ρ hmax , l) by Cρ −D1hmax . As will be evident in the proof of Theorem 1, an appropriate choice of the parameter h max is: Claim 1. With β n = O( log(n) + u), the following event occurs with probability at least 1 − e −u for any u > 0: where t n is the (random) number of rounds required by the algorithm to complete n function evaluations.
Proof. The largest value that the random variable t n can take is h max n, and for any t ≤ t n we have |L t | ≤ Cρ −D1hmax . Based on these two observations, we can claim the following: Finally, we get the required bound by selecting β 2 n = O u + 2 log(h max n) + D 1 h max log(1/ρ) .
Remark 7. The calculation of β n above is based on the worst case assumption that |L t | = M (X , v 2 ρ hmax , l). In the case of X ⊂ R D and for odd values of N , we can use a tighter bound |L t | ≤ nN h max which gives us β n = O( u + log(nh max )) which allows us to consider larger values of h max .
Next, we obtain the expressions for the parameters (V h ) h≥0 as an immediate consequence of Corollary 1: Claim 2. Suppose the metric space (X , l) is well-behaved in the sense of Definition 4 with subsets (X h ) h≥0 and associated parameters v 1 , v 2 and ρ. Let us define the event Ω u6 as Then for the choice of V h we have P r(Ω u6 ) ≥ 1 − e −u for any u > 0. Here C 3 and C 4 are the positive constants defined in Corollary 1.

Regret Bounds
Before presenting the regret bounds, we first characterize the sub-optimality as well as the number of times points are evaluated by Algorithm 1.
Lemma 1. Under the events Ω u5 (18), and Ω u6 (19), the following statements are true: • If at time t a point x ht,it is evaluated by the algorithm, then the suboptimality of the selected point (denoted by ∆(x ht,it )) can be upper bounded using V ht : • Furthermore, if the evaluated point x ht,it satisfies the condition that h t < h max , then we have another bound on ∆(x ht,it ) in terms of the posterior variance: • A point x h,i , with h < h max , may be evaluated no more than q h times before it is expanded, where where using the assumptions on the covariance function K.
Proof. We recall that under the event Ω u5 we have |f ( x h,i ∈ L t and for all t ≥ 1. Furthermore, form the definition of event Ω u6 , we have the following for all h ≥ 0 and 1 ≤ i ≤ N h : Using these two facts we can prove the first part of this lemma in the following way: • Suppose at time t, the true maximizer x * lies in the cell X h * t ,it * associated with the point x h * t ,i * t , and the algorithm selects and evaluates the point x ht,it . Then we have the following sequence of inequalities: , we use the fact that p(x ht,it ) must have been expanded which means β n σ t−1 (p(x ht,it )) must be smaller than V ht−1 . For inequality (d) we observe that x ht,it must lie in the cell associated with p(x ht,it ) and then use the definition of V ht−1 , while (e) follows from the triangle inequality.
• For obtaining the bound in (21), we again use the definition ofŪ t (x ht,it ) to now upper bound it by the other term in its definition to get: The inequality (f ) above uses the fact that since the function is evaluated at time t, we must have β n σ t−1 (x ht,it ) ≥ V ht .
• A point x h,i must be evaluated by the algorithm sufficiently many times to reduce the uncertainty in the function value at x h,i from below V h−1 to below V h . We provide a loose upper bound on this quantity, by providing an upper bound on the number of function evaluations sufficient to reduce the uncertainty in the value of f (x h,i ) to below V h . Using the first part of Proposition 3, we define q h as follows to get the required result.
Remark 8. From Lemma 1, we can see that the algorithm only selects points lying in where the term D f (·, ·) was introduced in Definition 3. We will use this termD for presenting our regret bounds, and will refer to it as the near optimality dimension of X associated with the function f .
We can now state the main result of this section which gives us high probability bounds on the cumulative as well as simple regret of Algorithm 1.
Theorem 1. Suppose the unknown function f is a sample from a GP (0, K), with K ∈ K and X is a well behaved metric space (in the sense of Definition 4) with finite metric dimension (see Definition 3) D 1 .
For any u > 0, the following bounds are true with probability at least 1 − 2e −u for Algorithm 1: • The cumulative regret incurred by Algorithm 1 satisfies whereD (described in Remark 8) is a non-negative random variable always less than or equal to D 1 .
• Furthermore, if we make the assumption that K(x, x) ≤ 1 for all x ∈ X , we have an information type bound on the cumulative regret: • Finally, we also have an upper bound on the simple regret: The proof of this result is given in Appendix C.
Remark 9. The bounds in (33) and (35) which depend on near-optimality dimension will be referred to as dimension-type regret bounds in accordance with the terminology used by Slivkins (2014). We note that since the cumulative regret of the algorithm can be bounded in two ways, by taking the minimum of the bounds in (33) and (34), we can get a uniformly better upper bound on the cumulative regret for our algorithms for all GPs with admissible covariance functions with K(x, x) ≤ 1.

Discussion
The analysis of Algorithm 1 presented in the previous section is valid for arbitrary well-behaved search space X , any covariance function K ∈ K and in the presence of observation noise. In this section, we discuss the performance of our algorithm under some specific problem instances. In particular, we first show that our adaptive approach leads to computational requirements which do not explode with the dimension D when X ⊂ R D , unlike the existing algorithms for GP bandits. We then validate the intuition provided by our toy examples in Section 1.3 by showing that the information-type bounds are indeed loose for an important family of Gaussian Processes. Finally, we specialize our results to the noiseless case, and show that our algorithm compares favorably with BaMSOO in the pure exploration problem.

Computational benefits of adaptivity
As an upshot of the adaptive discretization of the search space, the computational complexity of Algorithm 1 does not grow exponentially with the dimension of the search space, as shown in the following result. Proof. Recall that the search space considered here is well-behaved in the sense of Definition 4, and has a finite metric dimension D 1 = D. Furthermore, since N is odd, we observe that the sequence of partitions (X h ) h≥0 are nested. More specifically, if the cell associated with a node x h,i is refined to add the nodes {x h+1,j ; (N − 1)i + 1 ≤ j ≤ N i} to the leaf set, then we have x h+1,(N −1)i+(N +1)/2 = x h,i . Let t n denote the number of rounds required for n function evaluations by the algorithm, and let (τ j ) n j=1 denote the round numbers in which function evaluations are performed. Now, if we define τ 0 = 1, then we claim that the following: • The posterior distribution is recomputed in rounds (τ j + 1) n j=0 based on the observations. The computational task of of updating the posterior based on j observations in the round τ j + 1 can be performed in O(j 2 ) operations by using the Cholesky Decomposition. Thus the total cost for posterior computation is O(n 3 ).
• For all t such that τ j + 1 < t ≤ τ j+1 , the index I t at a given point can be computed in O(j 2 ) operations. Since every refinement step adds N − 1 new points to the leaf set and τ j+1 − τ j ≤ h max for all j ≥ 0, the total cost of computing the index in this time interval is O((N − 1)h max j 2 ). For t ∈ {τ j + 1; 0 ≤ j ≤ n}, the index must be recomputed for the entire leaf set L t whose cardinality is upper bounded by (N − 1)h max j, and thus the computational . Thus the total cost of computing the index I t for all t ≤ t n is O((N − 1)h max n 4 ).
• For selecting the candidate points x ht,it for t ∈ {τ j + 1, 0 ≤ j < n}, we need to perform an exhaustive search over the entire leaf set L t which is a O((N − 1)h max j) operation. At all other times, we only need to search over the (N − 1) new descendants of the previous candidate point. Thus the total cost of selecting • As mentioned earlier, the refinement of a cell X h,i when X ⊂ R D is performed by dividing it equally in N parts along its longest side. This requires O(DN ) operations, so the total cost of refining the search space is O(h max nDN ).
Thus the overall computational cost of running the algorithm with a budget of n function evaluations for fixed D and N is O(h max n 4 ), which is equal to O(n 4 log n) using the constraint on h max given in (17).
As shown above, the computational complexity of our algorithm scales linearly with the dimension of the search space. This is in contrast to the existing algorithms for GP bandits which perform a global maximization of an acquisition function (ψ t (·)) for selecting a query point: The computational cost of performing this operation exactly can be exponential in D. For example in the GP-UCB algorithm the acquisition function is the upper confidence bound at each point x ∈ X . Over a search space X ⊂ R D , for the theoretical results to be valid, the GP-UCB algorithm must select a query point at time t by calculating and then maximizing the UCB over a uniform grid of size O(t 2D ) . Thus the overall computational cost of running this algorithm for n rounds is O n t=1 t 2D+2 = O n 2D+3 .

Improved bounds for Matérn kernels
Matérn kernels are a widely used class of kernels parameterized by a smoothness parameter ν. For half integer values of ν = m + 1/2, the Matérn kernels have the form: where p m = m j=1 a i r i for some a i > 0 for all 1 ≤ i ≤ m. Thus we can write for any x, y ∈ X such that l(x, y) = r: It is easy to check that for ν = 1/2, we have α = 1/2, and for all other half-integer values of ν, we have α = 1. So, for Matérn kernels, our algorithm has a dimension-type upper bound on regret of the formÕ(n (D+α)/(D+2α) ) for all ν = m + 1/2 with m ≥ 0 and α ∈ {1/2, 1}. This improves upon the existing upper bounds on Matérn kernels in the following two ways (since the existing bounds are true only when X ⊂ R D , we will restrict our comparison to this case, and so we have D 1 = D here): • The existing regret bounds are only valid for the case of ν > 1 Contal and Vayatis, 2016), whereas the dimension-type regret bounds of our algorithm is valid for all ν ≥ 1/2. In particular, for the exponential kernel (ν = 1/2, also referred to as the Ornstein-Uhlenbeck process), Srinivas et al. (2012) conjectured that it may not be possible to derive a regret bound of the form shown in (4). This conjecture was refuted by Contal and Vayatis (2016), but the authors did not provide an explicit characterization of R n as no suitable bounds for γ n for this kernel are known. Our result provides an upper bound on the cumulative regret for the exponential kernel of the form R n ≤Õ(n (2D+1)/(2D+2) ), which is, to the best of our knowledge, the first explicit sublinear bound on the cumulative regret for the GP bandits problem with exponential kernel.
• The existing regret bounds for Matérn kernels have the formÕ n D(D+1)+ν D(D+1)+2ν Contal and Vayatis, 2016) for ν > 1. As compared to this, the bounds obtained by our algorithm, after substituting α = 1 for Matérn kernels with ν > 1 depend uponD, which itself is a random variable dependent on the sample function f of the Gaussian Process and can take values anywhere from 0 to D. Assuming the worst case value ofD = D, we observe that for D ≥ ν − 1, we have Thus D ≥ ν − 1 is a sufficient condition for our upper bounds to be tighter than the best known bounds for Matérn kernels. The two most commonly used Matérn kernels in Machine learning correspond to ν = 3/2 and ν = 5/2 (Rasmussen and Williams, 2006, Chapter 4), for which the sufficient condition reduces to D ≥ 1 and D ≥ 2 respectively.

Regret under noiseless observations
In this section, we consider the special case where there is no observation noise, and specialize the regret bounds of our algorithm to this setting. In particular we have the following bounds: Claim 4. If in addition to the assumptions of Theorem 1, we further assume that the observations are noiseless, i.e., σ = 0, we get with high probability, the bounds ifD > 0, and S n ≤Õ(e −c1 log(1/ρ)n ) ifD = 0 and h max = Ω(n), for some constant c 1 > 0.
Remark 10. We note that unlike Theorem 1, we do not present information-type bounds on the cumulative regret in Claim 4. This is because the information-type bounds given by (4) are not directly applicable in the noiseless setting as the term γ n becomes undefined for σ = 0.
As mentioned earlier, our work is motivated by BaMSOO, an adaptive algorithm for the Bayesian optimization problem which works only with noiseless observations (Wang et al., 2014). BaMSOO builds upon the Simultaneous Optimistic Optimization(SOO) algorithm of Munos (2011) by making the further assumption that the unknown function is a sample from a GP, and then utilizes the posterior confidence intervals in selection of points. Wang et al. (2014) obtained an upper bound on the simple regret of the orderÕ(n −c/D ) for some c > 0 which is similar to our simple regret bound in Claim 4. However, our approach extracts more information about the function from the GP prior and has some advantages over BaMSOO in the pure exploration setting. In particular, the derivation of regret bounds for BaMSOO required the assumption (Wang et al., 2014, Assumption 2) that the unknown function is approximately quadratic in the region around the maximum x * , which for example is ensured if the covariance function has continuous partial derivative of order 6. Our result does not require this quadratic behavior, and is valid for kernels not satisfying the smoothness requirements, such as the exponential kernel K(r) = ce −c1r , and the kernel K(r) = (1 − r) + . Furthermore, if for some instances of the function f , the random variablẽ D equals zero, then we obtain an exponentially decaying simple regret bound for Algorithm 1. This is unlike the simple regret bounds for BaMSOO which decay polynomially in n for all admissible kernels.

Extensions
In this section, we first present an algorithm for GP bandits which uses an alternative approach to locally refining the search space as compared to Algorithm 1. While Algorithm 1 requires a tree of partitions to adaptively discretize the space X , the algorithm presented in Section 5.1 instead utilizes a covering oracle to explore the search space.
Next, in Section 5.2 we apply our general approach to design an adaptive algorithm for the problem of contextual GP bandits, an extension of the usual GP bandits problem first studied in (Krause and Ong, 2011).

Bayesian Zooming Algorithm
We now present a Bayesian version of the zooming algorithm for Lipschitz optimization introduced by Kleinberg et al. (2013) and analyze its regret. In particular, instead of assuming that the metric space (X , l) is well-behaved in the sense of Definition 4, this algorithm requires access to the space (X , l) through a covering oracle (see Remark 12 for definition) to locally refine the discretization.
The algorithm proceeds by constructing an increasing sequence of active subsets of X denoted by (A t ) t≥1 . As with Algorithm 1, we can compute high probability upper and lower confidence intervals for the function values at points in A t for all t ≥ 1.
for a suitable factor β n . Also, to each point x that has been evaluated at least once, we assign a radius denoted by r(x). The radius r(x) can take values in a set is the diameter of the metric space (X , l) and is assumed to be finite. For implementing the algorithm, we further require bounds (W (r k )) k∈N such that for all x ∈ X and for all k ∈ N, W (r k ) is a bound on the variation of the GP sample in the ball B(x, r k , l) with high probability. We obtain these W (r k ) using Proposition 2. We also require a parameter r min as input, which plays a role similar to h max in Algorithm 1. The details behind the choice of these parameters are provided in Appendix D.
Corresponding to each point that has been evaluated at least once, we have an associated confidence region B(x, r(x), l), and furthermore, we also have an upper bound on the maximum value of the function in that region (w.h.p.) given by the index: In each round t, a candidate point is selected in an optimistic manner from the set A t , i.e., The index J t (x) can take a large values if : • the point x has been evaluated very few times, in which case the uncertainty at x ( β n σ t−1 (x)) as well as the bound on the variation of f in the confidence region (W (r(x))) are large.
• or if the point x has been observed many times, and the true function value f (x) is large.
In this way the selection rule strikes a balance between exploration of poorly understood regions, and exploitation of well explored regions with high function values.
Having chosen a candidate point x t at time t, the algorithm takes one of two actions: • Refine: If the uncertainty in the function value a point x t is smaller than the bound on the variation of the function in the confidence region associated with point x t , then the algorithm locally refines the search space, that is, it shrinks the radius of the confidence region associated with x t by a factor of 2.
• Evaluate: Otherwise, if the uncertainty in the function value is larger than the variation in the confidence region, the function is evaluated at the candidate point x t .
In order to ensure that the entire search space is taken into consideration, the algorithm maintains at all times the following invariant: If this invariant is violated, a point from the uncovered region (i.e., X \ ∪ x∈At B(x, r(x), l) is added to the active set of points with an associated radius r 0 = diam(X ). All the steps described above are formally stated as a pseudo-code in Algorithm 2.
Algorithm 2: Zooming Algorithm for GP bandits Input : n > 0, (r k ) k≥0 , (W (r k )) k≥0 , r min Initialize: t = 1, n e = 0, Output : x(n): point with the smallest radius Remark 11. A key difference between Algorithm 2 and the zooming algorithm for Lipschitz functions is that our algorithm only evaluates a point if the confidence radius associated with it is small enough (Lines 3-4 of Algorithm 2). This is unlike the zooming algorithm in (Kleinberg et al., 2013), in which a point is evaluated every round. This modification is necessary to obtain the information type bounds on the cumulative regret for our algorithm.
Remark 12. For maintaining the invariant described in ( 32) and in Lines 10-12 of Algorithm 2, we assume the existence of a covering oracle (Kleinberg et al., 2013, Section 1.5), which takes in as inputs a finite set of balls and outputs whether these balls cover the entire space X or not. In the latter case, the covering oracle also returns an arbitrary point from the uncovered region of X . In our case, if at the beginning of round t the entire space is covered by the balls (this is true at t = 2), and suppose a point x is selected by the algorithm and its confidence radius is shrunk from r(x) to r(x)/2. Then at the beginning of the next round, we only need to check whether the annular region B(x, r(x), l) \ B(x, r(x)/2, l) is fully covered by the other balls or not.
Our next result shows that we can obtain the same regret performance for Algorithm 2 as we did for the tree-based algorithm.
Theorem 2. Suppose the unknown function f is a sample from a GP (0, K), with K ∈ K. (X , l) is assumed to be a compact metric space with finite metric dimension D 1 (see Definition 3). Moreover, we assume that we can access the metric space (X , l) through a covering oracle.
Then, for any u > 0, the following bounds are true with probability at least 1 − 2e −u for Algorithm 2: • We have the following dimension-type bound on the cumulative regret.
whereD Z is the near-optimality dimension defined in Remark 13 • Under the extra assumption that K(x, x) ≤ 1 for all x ∈ X , we also have an information type bound on the cumulative regret: • Finally, we also have an upper bound on the simple regret: The details of the choice of the parameters of Algorithm 2 as well as an outline of the proof of Theorem 2 is provided in Appendix D.
Remark 13. The near-optimality dimensionD Z used in the statement of Theorem 2 can be defined similar to the definition ofD introduced in Remark 8. More specifically, by Lemma 2 in Appendix D we know that Algorithm 2 only selects evaluation points from sets of the form So we can proceed as in Remark 8 to defineD Z = D f (δ K , ζ K ), with ζ K (z) := 5W (r kz ) with k z := min{k ≥ 0 : r k ≤ z}.

Extension to Contextual GP bandits
The contextual bandit problem is a generalization of the multiarmed bandit (MAB) problem in which at the beginning of each round, the agent receives a context, and the task is to select an action which is optimal for the context received. Krause and Ong (2011) considered this problem in the Bayesian framework with GP prior and proposed the CGP-UCB algorithm which is a variant of the GP-UCB algorithm. They obtained information-type regret bounds on the contextual regret for CGP-UCB and additionally, provided bounds on the maximum information gain (γ n ) for composite kernels over the product space. This problem has also been studied in the non-Bayesian setting by imposing Lipschitz condition on the payoff functions (Slivkins, 2014).
For this problem, the set X is a product of two sets, X c the context set and X a the action set, and f : X → R is the mean reward observed for a context-action pair. As before, we will assume that the unknown function f is a sample from a Gaussian process GP (0, K) now indexed by the product set X = X c × X a . In each round τ , the agent receives a context x c τ ∈ X c , and must select an action x a τ ∈ X a corresponding to that context and observe the reward y τ = f (x τ ) + η τ where x τ = (x c τ , x a τ ) ∈ X . The goal of the agent is to design a strategy of selecting actions to minimize the contextual cumulative regret: where we have

Tree based algorithm for Contextual GP bandits
We again make the assumption that the space X admits a tree of partitions satisfying the properties described in Definition 4. To simplify the description of the algorithm, we will assume that the metric space admits a binary tree of partitions (i.e., N = 2). We show that with a small modification to the point selection rule and the cell expansion strategy, we can easily adapt Algorithm-1 to the problem of contextual GP bandits. We need to introduce a couple of definitions in order to describe the algorithm. We call a cell X h,i active with respect to a context x c ∈ X c , if there exists an action x a ∈ X a such that (x c , x a ) ∈ X h,i . Now, given a context x c , for every active cell X h,i (corresponding to a point x h,i ∈ L t ) we find a point of the from (x c , x a h,i ) ∈ X h,i , and we will refer the collection of these points as a leaf set relevant to the context x c denoted by L rel t . Suppose a cell X h,i with 0 < h < h max is expanded by the algorithm at time t 0 . Then for all t ≥ t 0 , we usex (t) h,i to denote the candidate point in the cell X h,i which was chosen by the algorithm at time t 0 . This point has the property that β n σ t−1 (x Clearly, this property is true at time t = t 0 (by Line 6 of Algorithm 3). Furthermore, since the posterior variance at a point cannot increase as more observations are made, the inequality holds for all t > t 0 as well.
For all points in L rel t , we define as index as follows: The rest of the algorithm proceeds in a manner similar to Algorithm 1. We select a candidate point by maximizing the index I c t over the relevant leaf set L rel t . Having selected the candidate point, we either evaluate the function or refine the discretization depending on the uncertainty in the function value at the chosen point.
The steps of the algorithm are shown as a pseudo-code in Algorithm 3. The values of the parameters h max , β n and (V h ) h≥0 used here are the same as those used in the algorithms for GP bandits, with the modification that n now represents the total number of context arrivals and X = X c × X a .

Algorithm 3: Tree based Algorithm for Contextual GP bandits
Input

Bounds on contextual regret
For the algorithm for contextual GP bandits described above, we now present high probability bounds on the contextual regret R c n : Theorem 3. Suppose Algorithm 3 is applied to a contextual GP bandits problem, where the reward function f is a sample from a zero mean GP with covariance function K ∈ K and furthermore K is assumed to be isotropic 3 The product space X = X c × X a is assumed to be well-behaved (Definition 4) with finite metric dimension D 1 . Then after observing n contexts, we have for any u > 0 with probability at least 1 − 2e −u : In addition if we further assume that K(x, x) ≤ 1 for all x ∈ X , then we can also have an information type bound on the contextual regret: The proof of the above result essentially follows the same arguments used in the proof of Theorem 1, and we omit the details here. For deriving the dimension type contextual regret bound, we will require an intermediate lemma analogous to Lemma 1. The derivation of this result differs from Lemma 1 in the following two ways: • Unlike Algorithm 1, a single point cannot be evaluated repeatedly in the contextual case as the contexts are not chosen by the algorithm. Thus to get a bound on the term q h here, we need to upper bound the posterior variance at a point given a certain number of function evaluations at points in a ball B(x, r, l). For this we use the result in the second part of Proposition 3.
• The definition ofx (t) h,i introduced earlier is crucial in obtaining a bound on the sub-optimality of the chosen action analogous to that in (20). Suppose the algorithm selects an action x c t which is at level h t of the tree, in response to a context x c τ and let x * τ := (x c τ , arg sup x a ∈Xa f (x c τ , x a )) and x ht,it = (x c τ , x a t ) (note that τ is the index of the context (i.e.,1 ≤ τ ≤ n) and t is the index of the round (i.e.,1 ≤ t ≤ nh max ) in Algorithm 3). We then proceed as follows: The inequality (a) above uses the definition ofx (t) ht−1, it/2 and (b) uses the fact that 2V ht ≤ V ht−1 .
With these results available, the remainder of the proof of Theorem 3 mirrors the proof of Theorem 1.
Remark 14. Compared to the CGP-UCB algorithm of Krause and Ong (2011), Algorithm 3 again has two benefits. First, if X ⊂ R D , then the computational cost of running the algorithm does not depend on the dimension of the space, unlike the CGP-UCB whose practical implementation cost increases exponentially with D. Second, as with Algorithm 1, our theoretical regret bounds are tighter for Matérn kernels when we have D ≥ ν − 1.
Remark 15. Krause and Ong (2011) considered composite covariance functions formed either by taking products K(x c , c a ) = K c (x c ) × K a (x a ) or by taking sums K(x c , x a ) = K c (x c ) + K a (x a ) of different covariance functions over the context space and the action space. Since our class of covariance functions K is closed under such operations, if K c and K a lie in K then their composition will also be in K, and thus our dimension-type bounds on the contextual regret is valid for such composite covariance functions. In addition, for the information type bound we can use (Krause and Ong, 2011, Theorem 2 and Theorem 3) to get the required upper bound on γ n .

Technical Results
In this section, we present some analytical results about the Gaussian Processes satisfying the assumptions described in Section 3.3.1, which were used in the design of our algorithms.
We begin by deriving a high probability bound on the maximum variation of the sample functions of a Gaussian Process within a d-ball of radius b around some fixed point x.
Proposition 1. Suppose {f (x); x ∈ X } is a separable zero mean Gaussian Process GP (0, K), and let d denote the usual metric on X induced by the GP. Let B(x 0 , b, d) ⊂ X be a d-ball of radius b > 0. Then we have for any u > 0: with w b ≤ 4b C 2 + 2u + 2D 1 log(1/b) + C 3 ). Here C 2 and C 3 are positive constants and D 1 is the metric dimension of B(x 0 , b, d) with respect to d.
The details of the proof of this statement is given in Appendix B.1. The proof uses the classical chaining technique for bounding the suprema of Gaussian Processes, and follows the same line of arguments used in some existing results in literature such as (Contal, 2016, Theorem 3.3) and (van Handel, 2014, Theorem 5.24).
The previous result gives us a bound on the variation of the samples of a given Gaussian process within a given d-ball of radius b. Using this and the union bound, we can easily extend this to a sequence of discretizations of X : Corollary 1. Suppose {f (x); x ∈ X } is a zero mean Gaussian Process which induces the metric d on X . Let (X k ) k≥0 be a sequence of finite subsets of X , and to every point in X k we associate a radius b k with respect to the metric d. Then we have P r(Ω u ) ≥ 1 − e −u , where the event Ω u is defined as with the value of w k given by: where C 4 = C 2 + 2 log(n 2 π 2 /6), and D 1 is the metric dimension of (X , d).
Proof. The result is obtained by replacing u k ← u k + log(n 2 π 2 /6) + log(|X k |) in the proof of Proposition 1 and then taking two union bounds, one over points in X k for a fixed n and the other over all values of n ∈ N.
Specializing this result to the class of Gaussian Processes with covariance functions K ∈ K, we can obtain bounds on the variation of the GP samples in l-balls.
Corollary 2. Suppose {f (x); x ∈ X } is a Gaussian Process with its covariance function K ∈ K, and let l be a metric defined on X . Then for (X k ) k≥0 subsets of X , and (r k ) k≥0 the associated radius values, we have for any u > 0: where the event Ω u1 is defined as with the value of w(r k ) given by: This result gives us control over the variation of the Gaussian process samples in balls centered at points in (X k ) k≥0 . Now suppose we want to obtain high probability bounds on the variation of the GP samples in l-balls of radius (r k ) k≥0 for all points x ∈ X and not just those in (X k ) k≥0 . Our next result shows that we can obtain this by a small modification of the previous result.
Proposition 2. For a given sequence (r k ) k≥0 , we have for any u > 0, P r(Ω u2 ) ≥ 1 − e −u , where the event Ω u2 is defined as andw k = 2w(R k ), where w(R k ) is as defined in 45 by selecting X k to be an k cover (for any k > 0) of X , and choosing R k satisfying R k ≥ r k + k .
This result is crucial in the design of Algorithm 2 as the covering oracle can return an arbitrary point in the uncovered region of the search space X , and thus we need to bound the variation of f in ball centered at any point x ∈ X with radius r k for k ≥ 0.
R r x z_x Figure 2: If X is an cover of X , then for any x ∈ X there exists a z x ∈ X within distance of x. A ball of radius R ≥ r + will contain the ball B(x, r) and so twice the variation of f in B(z x , R) (denoted by w(R)) is an upper bound on the variation of f in B(x, r) Remark 16. The result follows by application of Corollary-2 for the given choice of R k and k .
However, the idea behind this result can be better understood through Figure.2. Let us consider a fixed radius r k . We want a boundw k such that for all x ∈ X we know that with high probability the variation of a Gaussian process sample within the ball B(x, r k ) is no more thanw k . Since the set X in general can be uncountable, we cannot directly use union bound to get this result. However, we can get a bound in the following way: For some k > 0, consider an k covering of X , denoted by X k . Now for every point z ∈ X k , we associate a ball B(x, R k , l) with R k ≥ r k + k and compute the corresponding variation (w(R k )) within this ball for all x ∈ X k by using Corollary 2. By definition of X k , for all x ∈ X there exists a z x within k distance of x, and by the choice of radius R k , we know that B(x, r k , l) ⊂ B(z x , R k , l). Now, by the triangle inequality, we have for all y ∈ B(x, r k , l), |f (x) − f (y)| ≤ |f (x) − f (z x )| + |f (y) − f (z x )|, which gives us the required bound w k ≤ 2w(R k ).
Finally, we present a result about the posterior variance at a point x at which we have multiple noisy observations. Proposition 3. Suppose the unknown function f is a sample from GP (0, K) with K ∈ K.
• If a point x has been evaluated n t (x) times before time t according to the observation model y(x) = f (x) + η where the noise term η is distributed according to N (0, σ 2 ). Then we have where σ t (x) is the posterior variance at the point x after t observations.
• Suppose, we make the further assumption that the covariance function K is isotropic, i.e., K(x, y) = K(r) where r = l(x, y). Now, if n t (x, r) denotes the number of times a point from the ball B(x, r, l) has been evaluated up to time t, then we have This result allows us to estimate the number of evaluations required to bring the uncertainty about the function value at a point below a certain threshold. The first part of the above result is used in the analysis of the two algorithms proposed for GP bandits (Algorithm 1 and Algorithm 2), while the second part is used in the analysis of Algorithm 3 for Contextual GP bandits (Krause and Ong, 2011).

Conclusion
In this paper, we considered the problem of optimizing an unknown function under noisy bandit feedback, and presented an algorithm which adaptively discretizes the search space using a hierarchical tree of partitions. We then obtained high probability bounds on the cumulative and simple regret for our algorithm. Because of adaptive refinement of the search space, our algorithms can be computationally much cheaper than the existing approaches using uniform discretizations. Furthermore, we also identified sufficient conditions under which the regret bounds of our algorithms improve upon the existing theoretical results.
Finally, we note that the tools described in Section 6, along with some stronger bounds on suprema of GPs such as those presented in (Contal and Vayatis, 2016;Van Handel, 2015) may be useful for designing adaptive algorithms for some other settings, such as time varying GP bandits problem (Bogunovic et al., 2016a).

A.1 Example 1
First we note that the covariance function of the Gaussian Process is uniformly upper bounded by a 2 1 which implies that the information type regret bound is valid for it . Before obtaining the lower bound on γ n , let us select the parameters (a i ) i≥1 in the following way for a fixed δ > 0: where Φ(.) is the cdf of Standard Normal random variable. Now, using (Srinivas et al., 2012, Lemma 5.3), we have ≥ n a 2 n a 2 n + σ 2 = n 1 + 8σ 2 log π 2 n 2 3δ where (a) follows from the inequality log(1 + x) ≥ x 1+x for x ≥ 0. From the above, we get the following bound: This implies that for all σ 2 > 0, the information type regret bound for this Gaussian Process increases linearly with n. Now we show that for the given choice of parameters for this Gaussian Process, the global maximizer of the sample function f can be found from just one evaluation with high probability. Let us define the following events: E 1 = {|a 1 X 1 | ≥ 1}, E 2 = {∀i ≥ 2 : |a i X i | ≤ 1/2} and E 3 = {|η 1 | ≤ 1/2} where η 1 is the observation noise at time t = 1. Then, we have P r(E 1 ∩ E 2 ∩ E 3 ) ≥ 1−3δ, and it is easy to see that the global maximum of the function f under the event E 1 ∩E 2 ∩E 3 will lie either at x = 1/2 or x = 5/6. Since, by construction we have f (1/2) = −f (5/6), a single evaluation of the function at either of these two points is sufficient to find the global maximum, and hence the regret R n ≤ O(1).

A.2 Example 2
We observe that the covariance function of the Gaussian Process is upper bounded by ∞ i=1 a 2 i which for our choice of parameters a i will be finite. If we make the extra assumption that the noise variance is smaller than a 2 n , we get that γ n ≥ n log(2) which implies that the information type regret bound increases linearly with n. Now, for a fixed δ > 0, let us define the event By using the tail bounds for Gaussian random variables and the union bound, we get that P r(E 4 ) ≥ 1 − δ. We now set the parameters as follows: for all i ≥ 1 σ = a n / √ 2 Next, suppose (η t ) t≥1 denote the i.i.d. N (0, σ 2 ) noise random variables. We define the following event which also occurs with probability at least 1 − δ: Now, we need to show that there exists a strategy which will ensure with high probability that the cumulative regret is upper bounded by O(log(n)). Assuming that the events E 4 and E 5 hold (which happens with probability at least 1 − 2δ), we proceed as follows: • We first note that we can construct a ternary tree of intervals ({I j,k : j ≥ 0, and 1 ≤ k ≤ 3 j }) which form an increasing sequence of partition of the input space X = [0, 1].
• Because of the definition of the Gaussian Process, the function value in the interval I 1,1 is a 1 X 1 ϕ(3x) + f 2 (3x) and in the interval I 1,3 is −a 1 X 1 ϕ(3x − 2) + f 2 (3(x − 2/3)), we note that x * must lie either in I 1,1 or I 1,3 . To decide which one, we need to know the sign of X 1 for which we observe the function at the mid point of the interval I 1,2 . If the observed value is positive, we can conclude that x * must lie in I 1,1 . Otherwise, x * lies in I 1,3 . Thus our region of uncertainty shrinks from I 0,1 to I 1,1 or I 1,3 .
• For t > 1, we proceed similarly by evaluating the function at a point x t in the middle sub-interval of the current region of uncertainty. Based on the observed value, we can infer the sign of a t X t which allows us the pick the next subinterval. Thus at any time t, the suboptimality of the evaluated point is upper bounded by where the second inequality follows from the definition of event E 4 and the choice of (a i ) i≥1 .

B.2 Proof of Proposition 3
Proof. Letȳ 1:t−1 denote all the observations before time t, andȳ x be the vector of observations at x. Also, letȳ x c be the vector of observations at points other than x. Then by the non-negativity of mutual information we have where h(X) is the differential entropy of X and I(X; Y ) denotes the mutual information between random variables X and Y . For inequality (a), we used the formula for the differential entropy of a Gaussian random variable. For the second part, let us define S x = {x 1 , x 2 , . . . , x nt(x,r) } as the set of points in B(x, r, l) which have been evaluated up to time t. Further introducing the vector K x = [K(x, x 1 ), K(x, x 2 ), . . . , K(x, x nt(x,r) ] tr where tr denote the transpose operation, and the matrix K xx = [K(x i , x j )] (xi,xj )∈Sx×Sx , we have by the formula for the posterior variance at x (Rasmussen and Williams, 2006, (2.26)): Now, based on the assumption that K is isotropic, we can make the following two observations, (K xx + σ 2 I) (K(0)11 T + σ 2 I) K(r)1 K x which gives us − K T x (K xx + σ 2 I) −1 K x ≤ K(r) 2 1 T (K(0)11 T + σ 2 I) −1 1 Now, using the Woodbury matrix inversion identity, and some simplification, we get: where (a) uses the inequality √ z 1 + z 2 ≤ √ z 1 + √ z 2 for z 1 , z 2 ≥ 0 and (b) follows from the fact that K ∈ K.

C Proof of Theorem 1
For the entirety of this proof, we will assume that the events Ω u5 and Ω u6 hold, which is true with probability at least 1 − 2e −u . Let (τ j ) n j≥1 denote the rounds in which the function evaluations were performed, and let Q n = {x hτ j ,iτ j |1 ≤ j ≤ n} denote the multiset of points evaluated by the algorithm.

C.1 Information-type bound on R n
To obtain the information-type cumulative regret bound, we divide the set Q n into Q n1 and Q n2 , where Q n1 = {x h,i ∈ Q n |h < h max } and Q n2 = Q n \ Q n1 . From Lemma 1 , we know that for all x h,i ∈ Q n2 , we have ∆(x h,i ) ≤ (2N + 1)V h , and assuming n is large enough so that h max ≥ h 0 := log(v2/δ K ) log(1/ρ) , we can upper bound the contribution of the terms in Q n2 to the cumulative regret (denoted by R n2 ) as follows: where the last inequality relies on the assumption that h max ≥ h 0 and the properties of the covariance functions in the class K. Now, using the fact that h max ≥ (1/2) log(n) α log(1/ρ) we get that R n2 ≤ O( n log(n)).
In the inequality (c) above, we use the fact that for h ≥ h 0 , we have |X (2N +1)V h ∩ X h | = O(ρ hD ) by using the definition ofD, V h is O(ρ hα √ h) and q h = O( β 2 n ρ 2hα ) by the assumptions on the covariance function.
Finally, the contribution of the remaining points in Q n can be trivially upper bounded as: Now, if we select H = log(n) log(1/ρ)α(D+2α) < h max , we get R n ≤ O log(n) 3/2 n 1− 1 D+2α as required.
To obtain the bound on the simple regret, we introduce the terms Q h = Q n ∩ X h for h ≥ 0. for any H > 0, we have the following: Now, if we find the largest H (denoted byH) such that the upper bound on H h=0 |Q h | given above is smaller than n, thenH will be a lower bound on the maximum depth explored by the algorithm. From the definition ofH, we can show that there exists some constant C > 0, such that C log(n) n 1/(D+2α) ≤ ρH ≤ 1 ρ C log(n) n

1/(D+2α)
Assuming that n is large enough so thatH ≥ h 0 and that h max ≥H (which is true if h max ≥ log(n) 2α log(1/ρ) ), we can now upper bound the simple regret as follows: To complete the description of the algorithm, we need to calculate the terms β n and the term (W (r k )) k≥0 for radius r k = diam(X )2 −k . We begin with the following simple claim which gives us the appropriate choice of β n .
Claim 5. For the choice of β n = O( (D 1 /α + 1) log(n) + u), we have for any u > 0: where the event Ω u3 is defined as with t n is the (random) number of rounds of the algorithm required for n function evaluations.
Proof. Let M n = M (X , n −1/(2α) , l) be the n −1/(2α) packing number of X with respect to the metric l. Then by the design of the algorithm, at any time t we have |A t | ≤ M n , and also t n ≤ M n + n almost surely. So, we get by two union bounds: P r(Ω c u3 ) ≤ tn t=1 x∈At 2e −β 2 n /2 ≤ 2n 2 2e −β 2 n /2 ≤ 2M n (M n + n)e −β 2 n /2 Now, by using the fact that X has a finite metric dimension D 1 we have M n ≤ Cn D1/(2α) for some constant C > 0. This implies that M n (M n + n) ≤ C 2 n 1+D1/α for n ≥ 1.
Thus for any u > 0, the choice of β n = 2(u + 2 log(C) + (D 1 /α + 1) log(n)) ensures that P r(Ω u3 ) ≥ 1 − e −u . Now, we obtain the terms W (r k ) which denotes a high probability upper bound on the maximum variation in the GP sample within any ball of radius r k in X .
Proof. The result follows immediately by applying Proposition 2 with k = r k and R k = 2r k .
Without loss of generality, we can assume that the diameter of the search space X is 1. Then, in the expression for W (r k ) above, we can upper bound the term |X k | for all k by C2 kD1 due to the assumption of finite metric dimension of X . Thus for all k ≥ 0 we have W (r k ) ≤ O(g(r k ) u + 2D 1 k log(C 2 /diam(X ))) and in particular for k ≥ log 2 (1/δ 0 ) we have: W (r k ) ≤ O(2 −kα u + 2D 1 k log(C 2 /diam(X ))) Having described the algorithm parameters, we now present an outline of the derivation of the regret bounds for the Bayesian Zooming algorithm. We characterize the properties of the points selected by the algorithm in the following lemma. The proof of the regret bounds can be completed in an analogous manner to the proof of Theorem 1 Lemma 2. Under the events Ω u3 and Ω u4 , the following statements are true: • Any point x at which the function is evaluated by the algorithm satisfies: • If in round t, the function value is evaluated at a point x with r(x) > r min , then we have • Any two points x 1 and x 2 which have been evaluated at least k times each must satisfy l(x 1 , x 2 ) > r k .
• A point x with radius r k will be evaluated no more than q r k times before its radius is shrunk, where q r k is defined as: Proof. The results stated above follow directly from the point selection and refinement strategy used in the algorithm: