Filtered Poisson Process Bandit on a Continuum

We consider a version of the continuum armed bandit where an action induces a filtered realisation of a non-homogeneous Poisson process. Point data in the filtered sample are then revealed to the decision-maker, whose reward is the total number of revealed points. Using knowledge of the function governing the filtering, but without knowledge of the Poisson intensity function, the decision-maker seeks to maximise the expected number of revealed points over T rounds. We propose an upper confidence bound algorithm for this problem utilising data-adaptive discretisation of the action space. This approach enjoys O(T^(2/3)) regret under a Lipschitz assumption on the reward function. We provide lower bounds on the regret of any algorithm for the problem, via new lower bounds for related finite-armed bandits, and show that the orders of the upper and lower bounds match up to a logarithmic factor.


Introduction
The challenge of detecting interesting events, using limited resources, arises in numerous settings. In a defence context, surveillance teams wish to observe suspicious activity or gain intelligence. In ecological and environmental data collection, scientists wish to observe behaviours of endangered species or record notable measurements of environmental variables. In manufacturing and logistics settings, it is desirable to observe faults in machine operation or a supply chain.
However, in all of these settings, practitioners may face the problem of having insufficient resource to observe everything they wish to, and must optimise their resource allocation to maximise the detection of events. In these settings "resource" may refer to human searchers, fixed or mobile sensors, cameras, or a variety of other equipment with a capacity to observe events of interest.
Two factors play a particularly important role in the rate of detection. Crudely put, these are where we look, and how good we are at looking. In any of these settings, we can only expect to observe events in locations (spatial or temporal) where we deploy resource. Further, the precision of the detection may also be affected by how resource is deployed. If resource is spread over a large region, the probability of detecting events within this region may be lower than if focused on a small area.
Inspired by these challenges, we consider a stylised model of resource allocation which captures the challenge of balancing coverage and detection probability. This framework is sufficiently abstract to model problems across the various aforementioned applications and beyond.
Consider a decision-maker who aims to detect the maximum number of events occurring according to a Non-homogeneous Poisson process (NHPP) on a segment [0, 1]. The decision-maker selects a point y ∈ [0, 1] and then sweeps the sub-segment [0, y] searching for events. However, the decision-maker's search is imperfect, in that events in [0, y] are detected, independently of each other, with filtering probability γ(y), where γ : [0, 1] → [0, 1], is a known, nonincreasing function. The expected number of events detected by the decision-maker on a single sweep is then determined by the filtering probability, and the cumulative intensity function (CIF) of the NHPP, where λ : [0, 1] → R is the rate function of the NHPP. Given the decision-maker chooses to sweep [0, y], the number of events detected has a Poisson(Λ(y)γ(y)) distribution. Figure 1 illustrates this process. An example intensity function λ is represented by the blue curve and a function γ giving the filtering probability is given by the black curve. The blue points towards the bottom of the left pane illustrate a single sample of events from the NHPP with intensity λ. The decision-maker selects y = 0.6 and sweeps the sub-segment [0, 0.6], detecting each event therein with probability γ(0.6). The red piecewise-constant function in the right pane illustrates the effective filtering probability over [0,1]. The points plotted in red then represent the events actually detected by the decision-maker during their imperfect search -which we observe are a subset of the events that actually arose.
In this paper, we consider a sequential variant of this problem, where the CIF, Λ, is unknown to the decision-maker, but the choice of endpoint y can be updated over a series of rounds, in response to observing the locations of detected events in previous rounds. The decision-maker's aim is then to maximise the expected number of detected events over T ∈ N rounds. The study of this problem is motivated both by its theoretical challenge and its practical interest.
Versions of this problem may arise in a number of settings such as ecological surveillance, defence, and logistics, where sightings of endangered species, criminal activity, or machine faults may for instance comprise the events of interest. As a motivating, and sufficiently general example, consider a scenario where observations are made by searchers (representing cameras, sensors, robotic and human searchers, etc.), that must restart at the same point after each round. We note that while in the material that follows we will treat the line segment as indexing space (for clarity and consistency), it could equivalently be thought of as indexing time or space-time and apply to a yet broader range of examples. From a theoretical perspective, the problem is closely related to the one-dimensional case of the stochastic continuum-armed bandit (CAB) problem (Agrawal 1995). This is a sequential decision-making problem where in each of a series of rounds t ∈ [T ] ≡ {1, . . . , T }, a decisionmaker selects an action x t ∈ [0, 1] and receives a reward, which is a noisy realisation of some unknown smooth function f : [0, 1] → [0, 1] evaluated at x t . The decision-maker's aim is to maximise the expected sum of rewards amassed over T rounds. To realise this aim, the decisionmaker must deploy a strategy which appropriately balances between exploring the action space [0, 1] to learn the function f , and exploiting this information, selecting actions known to produce larger rewards to maximise the cumulative total.
In the Poisson process-based problem at hand, a similar dilemma arises, we lack knowledge of the filtered CIF -which corresponds to the reward function -and can only hope to maximise the sum of rewards by exploring the action space -i.e. choosing a range of endpoints y ∈ [0, 1]. However, the feedback received on actions in our problem is much richer than in the standard CAB problem. In addition to a noisy realisation of the filtered CIF, Λγ, we observe the location of detected events, which can help with the estimation of the reward function beyond the inferences from smoothness properties alone. Methods for the standard CAB problem are therefore inappropriate for the problem we face, as is the existing unmodified theory. In this paper we present a specific treatment of the previously described sequential endpoint selection problem, which we henceforth refer to as a Filtered Poisson Process Bandit (FPPB), deriving a bespoke decision-making algorithm and theoretical analysis of the problem.

Related Literature
Sequential decision-making problems on continuous action spaces have been studied extensively, following from initial works of Agrawal (1995) and Kleinberg (2005). Most successful strategies have employed a combination of adaptive discretisation of the action space, and optimism in the face of uncertainty. Our approach for the FPPB problem, also uses these techniques.
Adaptive discretisation, as used in the "Zooming" algorithm of Kleinberg et al. (2008) and "hierarchical online optimisation" (HOO) algorithm of Bubeck et al. (2011a), reduces the available action space in round t to some A t ⊂ [0, 1]. Restricting the action set ensures exploration occurs at a predictable rate, and makes the action selection more straightforward. Gradually, as the rounds proceed and more information is gathered, A t is increased, usually in a data-adaptive fashion to permit choice from a more granular set of actions. Intuitively, this is also appealing, as when estimates of the reward are very crude, there is little motivation to make decisions at a very granular level.
Optimistic approaches are those which encourage an appropriate balance of exploration and exploitation by making decisions with respect to high probability upper confidence bounds (UCBs) on the expected reward of the available actions. The Zooming and HOO algorithms both calculate UCBs for the reward of available actions in each round and select the action with the largest UCB. These approaches were the first to achieve order optimal performance, in terms of regret, for this class of problems.
Strong results have also been obtained by approaches which use Gaussian processes and avoid discretisation of the action space. The GP-UCB (Gaussian Process -Upper Confidence Bound) algorithm of Srinivas et al. (2010) constructs an upper confidence bound on the reward function over all actions, rather than at specific points, and selects the action which maximises this UCB function. This method also has order optimal performance guarantees, but with respect to a Bayesian measure of regret, rather than the frequentist one used in the analysis of the Zooming and HOO algorithms.
It is worth noting that none of these algorithms can sensibly be applied to the FPPB, and that their theoretical guarantees do not carry to the FPPB problem. Principally, this is because they lack a means to handle the additional feedback in terms of the location data, but a more subtle point is that without modification, these methods are not suited to unbounded rewards, as we have in this setting, with the Poisson distributed reward. Grant et al. (2020) consider a filtered Poisson bandit problem which is similar in some senses to ours, but theirs employs a fixed discretisation of the action space, such that the spatial locations of the events are irrelevant. They focus instead on the challenges of choosing multiple non-overlapping sub-segments and analyse performance with respect to the best possible action among a fixed discrete set. Grant et al. (2019) considers a continuous action space, but without filtering of the observations. Inference is therefore more straightforward in this setting, and the Thompson Sampling method proposed is not applicable to the FPPB setting. Recently, Lu et al. (2019) provide an algorithm combining the adaptive discretisation of Kleinberg et al. (2008) and heavy tailed UCBs of Bubeck et al. (2013) for a version of the CAB problem with heavytailed reward noise distributions. While the Poisson does fit in to this class of distributions, it also enjoys tighter bespoke concentration results, and a general heavy-tailed approach is overly conservative for the FPPB -even if event locations were not observed.

Key Contributions and Structure
The main contribution is a UCB algorithm withÕ(T 2/3 ) regret over T rounds. By derivation of a lower bound, we show that under the assumptions on the CIF, this is optimal up to a logarithmic factor. From the methodological viewpoint, we extend the Lipschitz multi-armed bandit framework (Kleinberg et al. 2008) to deal with a filtered Poisson process on continuum.
The remainder of the paper is structured as follows. In Section 2 we precisely state the problem of interest. In Section 3 we present our UCB approach to the problem. Sections 4 and 5 provide the upper and lower bounds on regret respectively. We conclude with a simulation of our method in Section 6, and discussion in Section 7.

Model
The formal specification of the FPPB problem is as follows. In rounds t ∈ [T ], the decisionmaker selects an endpoint y t ∈ [0, 1] and makes an observation on the sub-segment [0, y t ]. The environment generates a realisation of the NHPP with CIF Λ, consisting of an increasing sequence of event locations {X t,1 , X t,2 , . . . , X t,Nt } ∈ [0, 1] Nt , where N t ∼ Poisson(Λ(1)). The end-point selected by the decision-maker implies a filtering probability γ(y t ) ∈ [0, 1], such that events to the left of y t are detected independently of each other with probability γ(y t ), and all events to the right of y t are not detected. As a result, a sequence of i.i.d. Bernoulli(γ(y t )) random variables, B 1 , B 2 , . . . , B Nt , is generated. The decision maker receives the count of detected events R t ≡ R t (y t ) = Nt k=1 1(B t,k = 1, X t,k ≤ y t ) as a reward, and observes the locations of detected The decision-maker's objective is to maximise the sum of rewards obtained over T rounds, T t=1 R t . To realise this objective we aim to determine a policy, A, which maps from a history of actions and observations to a next action, which maximises the expected reward, or equivalently minimises the regret, where z * ∈ argmax y∈[0,1] Λ(y)γ(y) is an optimal endpoint which maximises the expected perround reward. Here the expectation is with respect to both the random process governing the generation and filtering of events and the decision-maker's actions. We will be interested in upper bounding the regret as a function of T for our proposed algorithm, and comparing the order of this upper bound to that of lower bounds on the best achievable regret of any algorithm. Bounded regret is achievable only if the reward function is suitably well-behaved as to admit learning from a finite sample of observations. This is ensured through assumptions on the form of the CIF and filtering function. These assumptions, enforced throughout the paper, are Lipschitz continuity of the filtered CIF and a rate bound, for m, λ max ≥ 0 known and finite. Assumptions A1-A2 are used to bound the estimation error for the expected number of detected events in each cell; this can be achieved by including in the cell index an additive term proportional to the cell length. We also assume that γ min = inf y∈(0,1] {γ(y) > 0} > 0; this is without loss of generality, as segments with γ(·) = 0 do not contain the optimal endpoint.

Algorithm
In this section we present our algorithm for the FPPB problem, CIF-UCB, given as Algorithm 1.

5:
Do a sweep up to b t .

13:
end if 14: UCB Computation: 15: At a high level, CIF-UCB proceeds as follows. For each round t = 1, . . . , T , the algorithm maintains a set of active cells, A t , which form a partition of [0, 1]. An index, I t , taking the form of optimistic estimate of the expected reward, is computed for each cell in A t . The algorithm selects the right endpoint of the active cell with largest index as the action for that round. Initially, the active set contains the unit interval, A 1 = {(0, 1]}, so that the algorithm does a complete sweep in the first round. If the number of sweeps of a cell exceeds some threshold in relation to its length, the cell is split in half. Hence, active cells make up a partition of the interval [0, 1] for all rounds. A new cell inherits the number of sweeps and detection count that fall in its interval from the parent cell.
Accumulating rewards over the interval to the left of the selected endpoint makes the problem structure combinatorial in nature, which poses a challenge for the analysis. The insight that makes the analysis tractable is that, by the independent increment property of the Poisson process, the filtered Poisson counts corresponding to the active cells that lie to the left of the endpoint selected by the algorithm in each round are independent. This leads to a CIF estimator for each active cell with tight error bounds.
We complete the notation needed to define the CIF estimator. Let {F t } T t=1 be the filtration induced by the sequence of event locations and cell selections ( be the collection of (random) times when active cell (x, y] is swept by round t and let, be the filtered Poisson count to the left of y in round τ i . Finally, let Z τi (y) be the total filtered Poisson count to the left of y over the rounds when cell (x, y] is swept. When the context is clear, we write V in lieu of V t (x, y) For active cell (x, y], Λ(y) is estimated by dividing the cumulative filtered Poisson counts up to y by its effective number of sweeps by round t, . ( Essentially, in (2) the filtered Poisson count is unfiltered by dividing it by . It's easy to see thatΛ t (y) is an unbiased estimator of Λ(y).
CIF-UCB samples from the origin to the endpoint of the active cell with largest index, and divides the latter cell if its length exceeds certain threshold. The complexity of the CIF-UCB is O(T ) for the variable updates, and O(

Upper Bound on Regret
In this section we present the first of our main theoretical contributions, an upper bound on the regret of CIF-UCB.
Theorem 1. The regret of CIF-UCB applied to the FPPB problem, with CIF and filtering function satisfying Assumptions A1 and A2 satisfies Proof. The proof has three main stages. We first bound the CIF estimator error for each active cell (Lemma 1), and then use the Lipschitz assumption to extend the bound to include all the points inside an active cell (knowing that one of these points is an optimal endpoint for some active cell; Corollary 1). Second, we use the Division rule to express the confidence bound of each active cell in terms of its length (Lemma 2), which yields a bound for the per-round regret of the cell selected by the algorithm (Lemma 3). Finally, we accumulate these per-round regrets to obtain an upper bound for the regret over T rounds.
Firstly, we present the following concentration result, which asserts that the difference between the true CIF and the estimated CIF is unlikely to exceed the upper confidence terms used in Algorithm 1.
Lemma 1. Let (x, y] be an active cell in round t. Then, . Proof. The Poisson count Z τi (y) is F τi measurable and, Solving for the r.h.s. above equal to T −3 leads to, It follows that the probability that is at most T −3 for each k ≤ T . Taking a union bound over all k ≤ T , and replacing for the definition ofΛ t (y) and ζ t (y) results in Finally, using the same approach it can be shown that so the proof is complete.
The Lipschitz assumption can be used to extend this to a high probability bound on the filtered CIF for active cells.
Proof. By the Lipschitz assumption, The index of a cell (x, y] active in round t is I t (x, y) = γ(y)Λ t (y) + m(y − x) + γ(y)ζ t (y).
The γ(y)Λ t (y) part of the index induces exploitation, while the m(y − x) + γ(y)ζ t (y) term promotes exploration.
All the results that follow in this section are on the sample paths where holds for all rounds t = 1, . . . , T . By Corollary 1, the contribution to the regret of the sample paths that violate (3) is of order O(1), after accounting for the T rounds and up to T cells by round T . Our next result bounds the upper confidence term ζ t for an active cell on the high probability event of Corollary 1.
Proof. Let V (p) (x, y) be the set of rounds the parent cell of (x, y] got swept. The Division rule for the parent cell implies .
Then, we obtain the conservative lower bound, Next we upper bound ζ t (y), where the first inequality follows from the definition of ζ t (y), and the second inequality follows from the lower bound (4).
Let z * be an optimal endpoint (i.e., γ(z * )Λ(z * ) ≥ γ(y)Λ(y) for all y ∈ [0, 1]), and (u t , v t ] ∈ A t the cell that contains z * in round t. The next result bounds the regret incurred in each round in terms of the length of the cell selected by the algorithm.
Proof. We will show that from where the claim follows.
For the first inequality, we observe that, In order, these inequalities follow from the Selection rule, the definition of the index function I t and Corollary 1, the fact that z * ∈ (u t , v t ], and the Lipschitz assumption. In the other direction, we have by application of Corollary 1, and then Lemma 2, The final stage of the proof combines these results to realise the bound on regret. By Lemma 3, the regret of cells with length at most is bounded by T (8m 2 2 max{1, 1/λ max } + 5m ) over all rounds. Cells with final length have three properties: (i) there are at most 1/ such cells; (ii) their regret per round is at most 8m 2 2 max{1, 1/λ max } + 5m (Lemma 3); and (iii), satisfy (Division rule) .
Using Eqs. (5) and (6) with = 2 −k results in, for all integer k ≥ 0. The value of k that minimises regret equalises the leading growth rates of both summands in (7), meaning that 2 k = T 1/3 . The claim follows from here.

Lower Bound on Regret
In this section we give a lower bound on the regret obtained by any algorithm for the filtered Poisson process bandit. The result is given below as Theorem 2, and we see, subject to further minor conditions on the filtering function, that the order of the lower bound on regret matches that of the upper bound on the regret of CIF-UCB up to a logarithmic factor. In this sense, CIF-UCB is therefore asymptotically order optimal (up to the exclusion of logarithmic factors).
Theorem 2. For the filtered Poisson process bandit problem on [0, 1] as described in Section 2 with filtering function γ satisfying for any 0 ≤ a ≤ b ≤ 1, there exists a valid CIF such that the regret of any algorithm is bounded below as Reg(T ) = Ω(T 2/3 ).
The proof of this lower bound is based on an established analytical technique of relating the regret of an algorithm for a continuum armed bandit problem to that of an algorithm for an associated finite-armed bandit problem. A lower bound on regret for the finite-armed problem is then utilised to lower bound the regret of the continuum armed bandit algorithm.
Here, such an associated finite-armed bandit problem must share the filtering structure of the FPPB to relate regret across the problems, and as such we require a bespoke finite-armed problem. Therefore, before giving the proof of Theorem 2, we introduce a filtered Poisson multiarmed bandit (FPMAB) problem which can be viewed as a discretised version of the FPPB. We derive a lower bound on the regret of any algorithm for the FPMAB, which is a key component of the proof of Theorem 2.
The problem takes place over a series of rounds t ∈ [T ], in each of which the decision-maker selects an arm a t ∈ [K] and receives a stochastic reward R t = R(a t ). In addition, the decisionmaker observes filtered observations,R k,t for 1 ≤ k ≤ a t . These observations are distributed asR The reward is defined as the sum of the filtered observations R t = at k=1R k,t , and therefore follows a Poisson distribution with parameter µ a , by the superposition property of the Poisson distribution.
Similarly as to the FPPB, the decision-maker's aim is to minimise regret in T rounds, defined as where a * ∈ argmax k∈[K] µ k is an optimal arm. We have the following minimax lower bound on the regret of any algorithm for the FPMAB problem.
Theorem 3. For any number of arms K ≥ 2, horizon T ∈ N, a set of filtering parameters γ 1 , . . . , γ K satisfiying γ k ≥ 1 + γ k+1 (9) for k ∈ [K − 1], and some small > 0 there exist a set of CIF parameters Λ 1 , . . . , Λ K and a known constant C > 0 such that the regret of any algorithm for the FPMAB problem is at least This Theorem is similar in spirit to the lower bound on regret for stochastic multi-armed bandits with bounded rewards in Theorem 5.1 of Auer et al. (2002), and its generalisation in Bubeck et al. (2011b). Indeed Theorem 3 has the same order with respect to and T however there are key differences in the proof of the result. Firstly, Theorem 3 considers filtered Poisson random variables, and therefore parts of the analysis are specific to the KL divergence between two Poisson random variables rather than Bernoulli random variables in the bounded case. Secondly, here we deal with the additional combinatorial feedback of FPMAB problem, and require further analyses to handle the resulting complexities.
In the remainder of this section we prove Theorems 2 and 3.

Proof of Theorem 2
Proof. Consider the instance of the filtered Poisson process bandit problem referred to as I(x * , ), for x * ∈ [0, 1] and > 0, and specified by the following reward function Such a reward function is realised by setting the CIF to To verify that this CIF is increasing, consider the derivative, We note that (γ(x)) −1 > 1 for all x ∈ [0, 1] since γ : [0, 1] → [0, 1], and that d(γ(x)) −1 /dx ≥ 0 for all x ∈ [0, 1] since γ is assumed to be strictly increasing on [0, 1]. It follows that for In the limit as b − a → 0 condition (8) implies that − dγ(x) dx ≥ γ(x). We have, for a differentiable function f such that f (x) = 0 that the derivative of g(x) = 1/f (x), that and it follows from (13) that dΛ x * , (x)/dx > 0 for x ∈ [x * , x * + ). For all other values of x ∈ [0, 1] it should be obvious that the derivative of the CIF is positive since it comprises a sum of non-negative terms. As such Λ x * , satisfies the necessary increasing assumption, and the instance I(x * , ) is a valid instance of the FPPB.
We will lower bound the regret of any algorithm for the problem instance I(x * , ) by relating it to an instance of the filtered Poisson MAB problem.
We fix K ∈ N to be defined later and let = (2K) −1 . Further we introduce the function f : This function is used to map between actions in the MAB problem and the CAB problem. We then define an instance J (a * , ) of the K-armed filtered Poisson MAB problem as that with arm means µ a = ν x * , (f (a)), a ∈ [K], and filtering probabilities It follows that in the problem instance J (a * , ) there is a single optimal arm a * ∈ [K] : x * ∈ [ a−1 K , a K ] with expected reward µ a * = m (1 + ) and all other arms, a = a * , have expected reward µ a = m .
Let ALG be any algorithm for the CAB problem I(x * , ). We will define ALG' as an associated algorithm for the MAB problem J (a * , ). These algorithms are related as follows. When ALG selects an action x t ∈ [0, 1], ALG' selects an arm a t ≡ a(x t ) ∈ [K] such that By definition of the FPMAB, ALG' will receive reward R (a t ) ∼ P ois(µ at ) and per-arm observationsR i,t ∼ P ois(γ(a t )(Λ i −Λ i−1 )) for i ≤ a t . Similarly, ALG will receive reward R(x t ) ∼ P ois(ν x * , (x t )) and observe point data in [0, x t ] derived from the filtered Poisson process. We shall also, however, demonstrate that R(x t ) can be shown to have the same distribution as a certain probabilistic function ofR (a t ) and use this representation to relate the regret of ALG and ALG'.
Define Z to be a Poisson random variable with parameter m (1 + ), and Y to be a Poisson random variable with parameter m . Then define r x , a random variable whose distribution depends on x ∈ [0, 1], as follows, where It follows that We notice that for both I(x * , ) and J (a * , ) the reward of the optimal action is m (1 + ). Further we have that E(R(x t )) ≤ E(R (a(x t ))) for all x t ∈ [0, 1]. It therefore follows that the regret of ALG' serves as a lower bound on the regret of ALG, i.e. we have E(Reg ALG (T )) ≥ E(Reg ALG (T )).
As ALG' is an algorithm for the FPMAB problem, its regret is lower bounded as in Theorem 3, and we therefore have for a known constant C > 0. We complete the proof of Theorem 2 by optimising our choice of K as a function of T . Substituting = 1/2K, we have and choosing K = O(T 1/3 ) yields the stated result.

Proof of Theorem 3
Proof. Given a set of filtering parameters γ 1 , . . . , γ K we construct a problem instance where there is a single "good" arm, i ∈ [K], with mean reward µ i = 1 + , for small ∈ (0, 1/2], and all other arms, k = i, have mean rewards µ k = 1. This is achieved by setting the CIF parameters as follows Λ Here the superscript · (i) denotes that i is the good arm under this choice of parameters, and we notice that the condition of the filtering parameters (9) is required for Λ K to constitute a valid (i.e. increasing) sequence of CIF parameters.
We define three notions of probability and expectation, relevant to the analysis of problem instances of this type. Let P * (·) denote probability with respect to the above construction of the FPMAB where the good arm is chosen uniformly at random from [K]. Let P i (·) be defined similarly, but denote probability conditioned on the event that i ∈ [K] is the good arm. Finally let P equ denote probability with respect to a version where µ k = 1 for all k ∈ [K]. We let E * (·), E i (·), and E equ (·) be respective associated expectation operators.
Let A be the decision-maker's algorithm, let r t = (R(a 1 ), . . . , R(a t )) denote the sequence of observed rewards in t rounds, and r t = (R 1,1 , . . . ,R a1,1 ), . . . , (R 1,t , . . . ,R at,t ) denote the sequence of filtered observations in t rounds. Any algorithm A may then be thought of a deterministic function from {r t−1 ,r t−1 } to a t for all t ∈ [T ]. Even an algorithm with randomised action selection can be viewed as deterministic, by treating a given run as a single member of the population of all possible instances of that algorithm. Further, we define G A = T t=1 R t to be the reward accumulated by A in T rounds and G max = max j∈[K] T t=1 R t (j) to be the reward accumulated by playing the best action. The regret of A in T rounds may be expressed as Let N k be the number of times an arm k ∈ [K] is chosen by A in T rounds. The first step of the proof is to bound the difference in the expectation of N i when measured using E i and E equ , i.e. to bound the difference in the number of times an algorithm with play i between when i is the good arm and when all arms are equally valuable.

By construction of the CIF paramters Λ
The expectation in the regret measure is taken with respect to P * , rather than any P i , as such E * (G A ) is the quantity of interest. We recall that under P * the "good" arm is chosen uniformly at random, and thus, it follows that where the second inequality uses Lemma 4. Considering the final term of (17), we have by Cauchy-Schwarz, and the regret is bounded as

Proof of Lemma 4
We first introduce some further notation used in the proof. Define for any distributions P and Q over vector sequencesr ∈ N K×T , the variational distance as and the KL divergence as .
By Pinsker's inequality, we have the following relationship between these distances Finally, the KL divergence between two Poisson distributions with parameters λ and ν is given as, Proof. For any function f : N K×T → [0, M ], with M > 0 constant, we have, where the final inequality follows from (18). Considering the KL divergence term in isolation, we have, by Theorem 2.5.3 of Cover and Thomas (2012) KL(P equ || P i ) = T t=1 KL P equ (r t |r 1:t−1 ) P i (r t |r 1:t−1 ) Here the parameters Λ equ k , k ∈ [K] refer to the choice of CIF parameters which yields µ k = 1 for all k ∈ [K]. The final equality follows from the observation that if a t < i then the distribution of the filtered observations is identical under P equ and P i . Decomposing on the sum over k, with the observation that for j > i + 1 the CIF parameters under the "single good arm" and "all arms equal" constructions will also match, meaning KL(γ k (Λ equ j−1 )) = 0, for any j > i + 1 we have for ≤ γi−γi+1 γi+1 . The inequality uses the identity It remains to bound the summation in (20) with an o( 2 ) term. For general a ∈ [0, 1], b ∈ [0, 1], and 0 ≤ x ≤ b, consider the function and thus for some C > 1 we have the following linear bound on the derivative Solutions to g(x) = Cx 2 are not available in closed-form, but since g(0) = 0, and dg/dx| x=0 = 0 we have as a minimum that g(x) ≤ Cx 2 for x as in (21). Choosing C = ab+a−b (a+b) 2 gives g(x) ≤ Cx 2 for x ∈ [0, a+b 2 ]. It therefore follows that for all x ∈ 0, γi 2γi+1 − γi 2γi−1 . Combining (20) and (22) we therefore have that the KL divergence from P equ to P i may be bounded as follows,

Experiments
In this section we illustrate the performance of CIF-UCB via numerical examples. We work with a linear intensity function λ(x) = 20 − 20x and exponential filtering probability γ(x) = exp(−x), both for x ∈ [0, 1]. The plot of Λ(x)γ(x) is shown in Figure 2, with x * = 0.586 and Λ(x * )γ(x * ) = 4.61 (found numerically). In the experiment, we set the Lipschitz constant m = 20, which equals max 0≤x≤1 (Λ(x)γ(x)) (since Λ(x)γ(x) is concave), and λ max = 20. We ran 100 independent sample paths over a time horizon of T = 50000, and computed the average cumulative regret over the 100 sample paths. The resulting average cumulative regret is shown in Figure 3, along with the upper regret bound, as determined in Theorem 1. Several observations are in order. First, the dotted curve in Figure 3 doesn't include the constant terms (equal to 360 in this case) nor the sub log(t)t 2/3 terms that come up in the regret upper bound derivation (cf. Eq. (7)). Still, we note that the regret growth is plausibly of order O(t 2/3 ).
The second observation concerns the shape of the average cumulative regret. Note that the cumulative regret appears to be piece-wise convex increasing, such that the regret of each extra convex piece grows at a slower rate; this observation is even more noticeable on individual sample paths (not shown). This growth pattern is due to the splitting condition of CIF-UCB, whereby the algorithm initially samples the best of the two segments that result from a split, and explores other (typically worse) segments as t gets larger. As t grows, the algorithm exploits more often, and thus each convex piece grows slower.
The final observation is about the splitting pattern. We include in Table 6 the data frame for the final round of a sample path in the R implementation, which includes the two endpoints (x and y), the effective number of samples of each final segment |V T (x,y)| i=1 γ(b τi ), the index I T (x, y), and the CIF estimatorΛ T (y) in the rightmost column. The finer spatial grid around x * is appreciable, suggesting that the algorithm gravitates towards the segment that contains the optimal solution x * . Note also that the estimates of Λ(x) = 20x − 10x 2 are very precise (the largest relative error is .62% for y large, since the segments close to 1 have the fewest number of effective samples |V T (x,y)| i=1 γ(b τi )). The index values are similar across the final segments, as is typical with UCB algorithms, and the effective number of samples drops off significantly to the right of x * . On the other hand, the effective number of samples to the left of x * is large, since the algorithm needs to cover that space to reach (and exploit) the neighborhood around x * .
x y Table 1: Summary of main parameters after a sample path To test the sensitivity of the algorithm to multiple local maximums, we ran a second experiment with parameters identical to those of the first experiment, except for the filtering probability γ(·), which now is set to be piece-wise linearly decreasing, This filtering probability leads to a Λ(x)γ(x) objective as in Figure 4, with x * = 0.8 and Λ(x * )γ(x * ) = 4.8. We tested CIF-UCB over 100 independent sample paths, with a time horizon T = 50000. This resulted in an average cumulative regret as shown in Figure 5.
Two main observations can be drawn. First, theÕ(T 2/3 ) upper bound of Theorem 1 holds over t ∈ {1, . . . , T }. Second, the average cumulative regret is about 10% larger than in the first experiment for t = T . This can be ascribed to the fact that the optimal value of the objective function is 4.8 versus 4.61 in the first experiment, and to the extra exploration induced by the local maximum at x = .33.

Discussion
This work considers a sequential variant of the problem faced by a decision-maker who attempts to maximise the detection of events generated by a filtered non-homogeneous Poisson process, where the filtering probability depends on the segment selected by the decision-maker, and the Poisson cumulative intensity function is unknown. The independent increment property of the Poisson process makes the analysis tractable, enabling the use of the machinery developed for the continuum bandit problem. The problem of efficient exploration/exploitation of a filtered Poisson process on a continuum arises naturally in settings where observations are made by searchers (representing cameras, sensors, robotic and human searchers, etc.), and the events that generate observations tend to disappear (or renege, in a queueing context), before an observation can be made, as the interval of search increases. Besides extending the state-of-theart to such settings, the main contributions are an algorithm for a filtered Poisson process on a continuum, and regret bounds that are optimal up to a logarithmic factor.