Learning vs earning trade-off with missing or censored observations: The two-armed Bayesian nonparametric beta-Stacy bandit problem

Existing Bayesian nonparametric methodologies for bandit problems focus on exact observations, leaving a gap in those bandit applications where censored observations are crucial. We address this gap by extending a Bayesian nonparametric two-armed bandit problem to right-censored data, where each arm is generated from a beta-Stacy process as defined by Walker and Muliere (1997). We first show some properties of the expected advantage of choosing one arm over the other, namely the monotonicity in the arm response and, limited to the case of continuous state space, the continuity in the right-censored arm response. We partially characterize optimal strategies by proving the existence of stay-with-a-winner and stay-witha-winner/switch-on-a-loser break-even points, under non-restrictive conditions that include the special cases of the simple homogeneous process and the Dirichlet process. Numerical estimations and simulations for a variety of discrete and continuous state space settings are presented to illustrate the performance and flexibility of our framework. MSC 2010 subject classifications: Primary 62C10; secondary 62N01. Received December 2016. 3368 Beta-Stacy bandit problems with censored data 3369

the basis of the previous observations, and it balances two conflicting benefits: the immediate payoff coming from the exploitation of a better known arm and the information concerning future payoffs coming from the exploration of a less known arm. A strategy is said to be optimal if it yields the maximal expected payoff, and an arm is said to be optimal if it is selected at the beginning of an optimal strategy.
More formally, let X k , Y k ∈ [0, ∞) =: R + be random variables (responses) generated from, respectively, arm 1 and 2 at stage k, for k = 1, 2, . . . , n, where n ∈ N \ {0}, N := {0, 1, 2, 3, . . . }, is the (possibly infinite) bandit horizon. If X k and Y k are the responses of the k-th patient to treatment 1 and 2, the k-th stage is interpreted as the moment at which it has to be decided the treatment to assign to the k-th patient, given past responses of the previous k − 1 patients to the treatments. Generally speaking, the k-th stage of the bandit problem is the phase when one of the two arms is chosen to be observed, on the basis of the responses of past k − 1 subjects. We assume that X 1 , X 2 , . . . , X n given the probability law F 1 are i.i.d with law F 1 , and that Y 1 , Y 2 , . . . , Y n given the probability law F 2 are i.i.d with law F 2 , with F 1 and F 2 independent. Then, we assume exchangeable responses within treatments and independent responses between treatments.
A strategy is interpreted, following Berry and Fristedt (1985), as a function that assigns, to each partial history of observations, the integer 1 or 2 indicating the arm to be observed at the next stage, or, equivalently, to which arm to assign the next subject. With the exception of the simplest cases, explicit specifications of optimal strategies are hindered by computational issues. As a consequence, as in Chattopadhyay (1994), optimal strategies can only be partially characterized in terms of break-even observations. We will consider two kinds of strategies: stay-with-a-winner and stay-with-a-winner/switch-on-a-loser strategies, assuming, without loss of generality, that a higher realized value of a random variable gives a higher payoff. Intuitively, if at the current stage arm 1 is optimal, according to the stay-with-a-winner strategy, arm 1 is optimally chosen to be observed at the next stage if the observation from arm 1 at the current stage is higher than a break-even point. The stay-with-a-winner breakeven point is defined such that the expected advantage of arm 1 over arm 2 is at least as high as it was before the observation at the current stage. On the other hand, in a stay-with-a-winner/switch-on-a-loser strategy, if arm 1 is currently observed, optimally or not, arm 1 will be optimally chosen at the next stage if the current observation is higher than a break-even point (different from the break-even point of the previous strategy), otherwise arm 2 is optimally chosen. The stay-with-a-winner/switch-on-a-loser break-even point is defined such that the expected advantage remains positive after the observation from the arm at the current stage. We will define more formally the two strategies in Section 3.3.

Related literature
Early examples of bandit problems are treated in Robbins (1952), Bellman (1956) and Bradt, Johnson and Karlin (1956). Among later works, Chernoff (1968) focuses on two Gaussian arms F i = N (μ i , σ 2 ), i = 1, 2, with unknown drifts μ 1 and μ 2 of the first and second arm respectively, and known constant and common variance σ 2 ; Berry (1972) gives sufficient conditions for optimal selection and the existence of a stay-with-a-winner strategy in a Bernoulli twoarmed bandit, F i = Bern(p i ), i = 1, 2, where p i is the unknown probability of observing a realized value of 1 from arm i; Berry and Fristedt (1979) characterize optimal strategies for Bernoulli one-armed bandits (F 1 = Bern(p) and F 2 known) with regular discount sequences; Gittins (1979) introduces dynamic allocation indices for optimal strategies in multi-armed bandits. Clayton and Berry (1985) is the first paper that extends the bandit problem to a Bayesian nonparametric framework, considering a random F 1 ∼ DP (α), the Dirichlet process introduced in Ferguson (1973), with bounded nonnull measure α on R, and known F 2 : the probability measure associated to the random variables in one of the two arms is random and extracted from the Dirichlet process. Dirichlet bandits are generalized to two-armed problems F i ∼ DP (α i ), α i probability measure on R, i = 1, 2, in Chattopadhyay (1994), where the existence of stay-with-a-winner and stay-with-a-winner/switch-on-a-loser optimal strategies is proven. Some other properties of Dirichlet bandits are studied in Yu (2011).

Our contribution
In this paper we extend Bayesian nonparametric bandits to problems where each arm generates an infinite sequence of exchangeable random variables (de Finetti 1937) having, as de Finetti measure, the beta-Stacy process (BS) of Walker and Muliere (1997). In our framework the two arms are random, with F i ∼ BS(α i , β i ), i = 1, 2, where α i and β i , extensively discussed in Section 2, characterize the two beta-Stacy processes. As specified in Phadia (2013), the beta-Stacy process generalizes the Dirichlet process in two respects: more flexible prior information may be represented and, unlike the Dirichlet process, it is conjugate to right-censored data. Also, when the prior process is assumed to be Dirichlet, the posterior distribution given right-censored observations is a beta-Stacy process. We will discuss in more details in Section 2 the properties of the beta-Stacy process.
The Dirichlet bandit of Clayton and Berry (1985) and Chattopadhyay (1994) is therefore an important special case of our setting, as is the bandit problem with the simple homogeneous process of Susarla and Van Ryzin (1976) and Ferguson and Phadia (1979). Our main result is that, under constraints on the parameters of the beta-Stacy processes (constraints that include the cases of the simple homogeneous process and the Dirichlet process), stay-with-a-winner and stay-with-a-winner/switch-on-a-loser break-even points characterizing optimal strategies exist and can be used for dealing with right-censored or exact observations. A right-censored observation is a realized value that is capped by a known censoring level: the observed response from arm 1 at stage k is x k = min{x * k , c x k }, the minimum between a true exact unobserved x * k and a known censoring level c x k , and equivalently for arm 2. We stress that we know if each observation has been censored or not, so that in the sequel we will simply denote by X 1 , . . . , X n the exact random variables from arm 1, and by Y 1 , . . . , Y n those from arm 2, with realized values x 1 , . . . , x n and y 1 , . . . , y n each known to be right-censored or not. For every k, c x k and c y k become known at stage k + 1, after the response of the k-th subject is observed. We assume that subjects's responses are immediate, in the sense that the (realized) response of subject k is observed together with the information that it is censored or not and with the value of c x k (or c y k ), but in our setting a censored observation will never become exact. A typical example of censored observation in our setting would be the survival time returned at the current stage by a patient who dies for causes not related to the treatment or who abandons the study, with no possibility of having in the future the exact response of the patient.
Arm responses with missing values can be seen as a special case of arms with right-censored observations: since X k and Y k ∈ R + , a missing observation can be treated as a censored observation with censorship level equal to zero. Coherently to Hardwick, Oehmke and Stout (1998), we then consider a missing observation as an observation subject to the hardest case of censorship: the one giving no information whatsoever on the true response value. On the other hand, in commonly right-censored observations (as in the motivating examples in Introduction) the censorship level obviously provides information on the minimum value of the true exact but unobserved response.

Some examples motivating bandits with censorship
Beta-Stacy bandit problems are motivated by the importance of dealing with censored observations in typical bandit applications: the two arms can be two treatments available for a certain disease (Berry and Fristedt 1985); patients arrive one at a time and a treatment is assigned. The patient returns information on the effectiveness of the treatment: this response can be censored if the patient returns its survival time after the treatment, but she interrupts the treatment or she dies for unrelated causes, or the obervational period ends before her death. The responses are therefore patients' survival times after the treatment (which may be censored) and the objective is to maximize the total discounted expected survival times. Another classical example of a bandit application with censored observations may arise when a manager of several teams of chemical scientists has to decide on the allocation of resources among the teams, with the aim of minimizing the expected time up to the creation of new successful products (Nash 1973): the two arms are the two teams of scientists, and a fixed budget for the creation of the new product is assigned only to one team. The arm response is the time invested by the team to create the successful product, a response that can be right-censored if the project is interrupted due to reduced financial support. A final example of a bandit problem with censored data is that of a batch job scheduling of an industrial processor, choosing which jobs to process at each stage, in the aim of minimizing the whole expected processing time (Gittins, Glazebrook and Weber 2011 and references therein): the arms correspond to the different jobs that the machine has to execute, and the arm response is the time needed to execute a specific task componing the job. The response may be censored if the task has failed and is unfinished for system breakdowns, or if it lasts more than the maximum amount of time allocated to the job. We then know that the task lasts at least up to the breakdown or up to the maximum time, but we do not have the precise task duration.

Outline of the paper
In Section 2 we introduce and define the beta-Stacy process (Section 2.1), relating it to other known stochastic processes (2.2) and recalling its posterior properties (2.3). The beta-Stacy bandit problem with two discrete-stage arms is detailed in Section 3: we first describe the mechanism of the problem in Section 3.1, then we introduce some further notation in Section 3.2, with particular emphasis on the expected bandit payoff and the expected advantage of choosing arm 1 over arm 2; finally in Section 3.3 we characterize the stay-with-a-winner and stay-with-a-winner/switch-on-a-loser strategies we study. In Sections 4 and 5 we show, respectively for discrete and continuous beta-Stacy arms, some properties of the expected advantage, namely its monotonicity (Sections 4.2 and 5.2) and continuity (5.3) in the arm response, and we then show the existence of break-even points of stay-with-a-winner and stay-with-a-winner/switch-ona-loser strategies (4.3 and 5.4). We apply our methods to simulated problem instances in Section 6, and conclude with examples of potential further applications and research directions in Section 7.

Introduction and definition
Under the assumption of exchangeability of the sequence of random variables X 1 , . . . , X k , . . . , with k ∈ N\{0} and each X i ∈ R + , from de Finetti Representation theorem (de Finetti 1937) there exist a random probability measure P and a corresponding random cumulative distribution function (cdf) F , conditionally on which X 1 , . . . , X k , . . . , are i.i.d. from F . That is, there exists a unique probability (or de Finetti) measure Q, defined on the space of probability measures on (R + , A), A the Borel σ-field of subsets of R + , such that the joint distribution of X 1 , . . . , X n , for any n ∈ N and events A 1 , . . . , A k in A, can be written as In our framework F is fixed to be the beta-Stacy process defined below. In the rest of the paper we denote with E the expected value with respect to the probability measure P. The expected value of F , E[F (t)] for all t ∈ R + , is called the base measure of F . Furthermore, the assumption of exchangeability implies, for any event A ∈ A, that with the special case, for any t ∈ R + , (1) Let the right continuous measure α and the positive function β both be defined on R + , with α(0) = 0. For t ∈ R + , we write α(t) for the value of the measure α over the region [0, t] be the countable set of discontinuity points of α, corresponding to jumps α{t} > 0. Let α c (t) = α(t) − t k ≤t α{t k }, so that α c is a continuous measure.
Definition 2.1. F is a beta-Stacy process on (R + , A) with parameters α(t) and where Z is a Lévy process with Lévy measure for Z(t) given, for v > 0, by and with log moment generating function given by where 1 − exp(−S t k ) ∼ Beta(α{t k }, β(t k )), and t k for some k ∈ N are the discontinuity points of α. If α is purely atomic on N (it is strictly positive only at t k , for some k ∈ N), we denote the beta-Stacy process to be discrete, otherwise the beta-Stacy process is said to be continuous.
In the previous definition we can intepret dN t (v) as the rate of arrival (intensity) of a Poisson process with jump of size v, whilst φ denotes the argument of the characteristic function E [−φZ(t)]. Note also that dα(t) and β(t), t ∈ R + can be respectively thought intuitively as the measure that a priori the Beta-Stacy process assigns to the infinitesimal interval around t, and to the interval (t, ∞). Finally note that it is not relevant to include the point 0 in the domain of the beta-Stacy parameters, since from the assumption α(0) = 0 the point zero has always null mass. For X|F ∼ F , F ∼ BS (α, β) discrete 1 , from (1) we can write, for all t ∈ R + , whilst if F ∼ BS (α, β) continuous, with α having discontinuity points {t k }: .
In order for Q to have a cdf almost surely (a.s.), the parameters α and β of a discrete beta-Stacy process are required to satisfy the condition When α has no discontinuity points, the analogue of condition (2) is the requirement of α and β satisfying When α has both continuous and discrete parts, condition (2) has to hold for all t which are discontinuity points of α, and condition (3) has to hold for α c , the continuous part of α defined above.

A first example and relation to other processes
In the current subsection we illustrate an example of beta-Stacy process with one prior discontinuity point, and we clarify under which conditions on the parameters of the process it reduces to the Dirichlet process and to the simple homogeneous process. As a first instance of a (continuous) beta-Stacy process, we fix, for all t ∈ R + and some λ, l ∈ R + where, for some event A, 1 A denotes the indicator function of A, equal to 1 if its argument belongs to A and 0 otherwise, and l is the prior discontinuity point.
Then it is clear that the Lévy measure is given, for all t, v ∈ R + , by and the log moment generating function, for all φ ∈ R, can be shown to be equal to where B(·, ·) is the usual beta function. Therefore, for all t ∈ R + , When β(t) = α((t, ∞)) for all t ∈ R * or t ∈ N, and α any measure on R + , we obtain the Dirichlet process prior of Ferguson (1973), as in this case α(t) = 1−e −λt and β(t) = e −λt , with Lévy measure as in (4), log moment generating function equal to the first term in the right hand side of (5), and a base measure with exponential density of parameter λ. Note that if β(t) = α((t, ∞)) (then a priori a Dirichlet process) and then there are right-censored observations, a posteriori the process is not a Dirichlet process but a more general beta-Stacy, since the relation between the α and β, both updated after the right-censored observations, changes. Another important special case is the homogenous process of Susarla and Van Ryzin (1976) and Ferguson and Phadia (1979), arising when β(t) = β ∈ R + constant for all t ∈ R + . As an example of simple homogeneous process, if we fix α(t) = 1 − e −λt and β(t) = β, it can be shown that the Lévy measure and the corresponding log moment generating function of the process are, for φ ∈ R, the following: The base measure of the simple homogeneous process is In the rest of the paper, we only consider beta-Stacy processes in the general formulation of Definition 2.1, but it is reasonable to conjecture that the results can be generalized to the class of Neutral to the Right (NTR) processes (Doksum 1974). The NTR process may be viewed in terms of a process with independent non-negative increments, via the parameterization F (t) = 1 − e −Z(t) , t ∈ R + , where Z is a process with independent nonnegative increments. The beta-Stacy process is a NTR process where Z is a so-called log-beta process (Walker and Muliere 1997), that keeps the conjugacy property under sampling exact or rightcensored observations.

Posterior properties
We now state the theorem of Walker and Muliere (1997) on the conjugacy of the beta-Stacy process.
Theorem 2.2. (Walker and Muliere 1997) Assume we observe X k = x k , for k = 1, . . . , n, n ∈ N denoting the sample size, and such that X k |F ∼ F , where F ∼ BS(α, β) is a discrete (continuous) beta-Stacy process. We partition x = (x 1 , . . . , x n ) as [x exact , x cens ] for respectively exact and right-censored observations. Then the posterior F |(X 1 = x 1 , . . . , X n = x n ) is also a discrete (continuous) beta-Stacy process BS(α is the sum of the number of exact observations greater than t and censored observations greater or equal to t. The theorem above clarifies an important property of the (continuous or discrete) beta-Stacy process: its conjugacy under sampling, possibly with right censoring. A posteriori (after the observation of x) the corresponding jumps S t , for all t ∈ x, are such that From the conjugacy property and equation (1), for X 1 , . . . , X n exchangeable from F ∼ BS (α, β) continuous and α with discontinuity points {t k } each in R + or in N, we can write, for any n ∈ N and t ∈ R + which specializes, for α with no discontinuity points, to and, for the discrete beta-Stacy process and for all t ∈ N to Note that the update of the discrete or continuous beta-Stacy parameters keeps track of not only the number of observations (censored or not), but also of their values.

Beta-Stacy bandit problem formulation
In the proposed framework, {α 1 , β 1 }, {α 2 , β 2 }; A n denotes the two-armed beta-Stacy bandit problem, with arm i having a beta-Stacy prior BS(α i , β i ), for i = 1, 2, and with A n = (a 1 , a 2 , . . . , a n ) being a nonincreasing discount sequence. Therefore the special choice of β i that follows the discussion after Equation (5), together with the absence of censored observations, reduce our setting to the Dirichlet bandit problem ({α 1 }, {α 2 }; A n ) of Chattopadhyay (1994). The objective in the bandit problem is to maximize the expected payoff, more precisely defined in the next subsection. Assuming without loss of generality that a higher response from the arms corresponds to a higher payoff, we want to choose at each stage which arm to observe with the aim of maximizing in expectation the sum of all observations from the arms. We only consider bandit problems with discrete stages: at the beginning of stage 1 (in the disease motivating example, before the assigment of the first patient to treatment 1 or 2) it is chosen which arm to observe, only on the basis of α 1 , β 1 and α 2 , β 2 : if the first arm is chosen, we will observe some realized value of X 1 , possibly right-censored, where X 1 |F 1 ∼ F 1 and F 1 ∼ BS α 1 , β 1 ; if the second arm is chosen, we will observe some realized value of Y 1 , possibly right-censored, where Y 1 |F 2 ∼ F 2 and F 2 ∼ BS α 2 , β 2 . Therefore at stage 1 the best arm is chosen for the first subject, optimally only on the basis of the prior choices of α 1 , β 1 and α 2 , β 2 , since no previous observations is available yet. Taking into account the additional information coming from the observation at stage 1 of X 1 or Y 1 , we will choose the arm to observe at stage 2, in a way that maximizes the expected payoff. Intuitively, if for instance a high realized value of X 1 is observed, it will be more likely to observe X 2 instead of Y 2 , that is to observe again from the same arm at the next stage, in a trade-off between the exploitation of an arm that is better known to return high observations, and the exploration of the potentially better but less known arm. At stage k > 1, k ∈ N, we decide to observe X k from arm 1 or Y k from arm 2 on the basis of the obervations [x k−1 , y k−1 ] = [x exact k−1 , x cens k−1 , y exact k−1 , y cens k−1 ] from the past k − 1 stages, where x k−1 are defined to be the observations from arm 1 at stages from 1 to k−1, partitioned in [x exact k−1 , x cens k−1 ] for exact and right-censored observations, and similarly for y k−1 and [y exact k−1 , y cens k−1 ] from arm 2. It is important to highlight that we assume throughout that we know if an observation is right-censored or not. Furthermore, the elements of [x exact k−1 , x cens k−1 , y exact k−1 , y cens k−1 ] can be empty: for instance if all the observations from arm 1 are exact up to stage k − 1, x cens k−1 = ∅, or if arm 2 has never been observed up to stage k − 1, , but with no conditioning on F 1 , there is a dependence between X k and previous observations from arm 1: and similarly for Y k , with , beta-Stacy processes with parameters updated in accordance to Theorem 2.2 above, defined as: α 2 where, coherently with the notation set up in Section 2.3, for all t ∈ R + (or in N for the discrete beta-Stacy process), and similarly for N y k−1 and M y k−1 . For notational convenience, we also define quantities prior to any information as

Expected payoff and advantage
As detailed in the previous section, in a bandit problem ({α 1 , β 1 }, {α 2 , β 2 }; A n ), a strategy selects at each stage k = 1, . . . , n which arm to observe, on the basis of past observations from the two arms. Then, a strategy can be characterized by a n-dimensional binary vector Γ = (γ 1 , . . . , γ n ), where for k = 1, . . . , n, with γ k dependent on past observations from both arms. Without loss of generality, we assume that higher observations are better, so that, for the discount sequence A n = (a 1 , . . . , a n ), we can write the payoff as An optimal strategy maximizes the expected payoff, that is the expected discounted sum of arms responses. With the exception of the simplest cases, explicit characterizations of optimal strategies are hindered by computational difficulties, imposing the need of partial characterizations of optimal strategies via break-even observations (Chattopadhyay 1994;Clayton and Berry 1985). In particular, we will prove the existence of stay-with-a-winner and stay-with-awinner/switch-on-a-loser break-even points. Similarly to Chattopadhyay (1994), we let be the expected payoff under an optimal strategy, whilst W i ({α 1 , β 1 }, {α 2 , β 2 }; A n ) is defined to be the expected payoff of a strategy starting from arm i and proceeding optimally. We define Δ({α 1 , β 1 }, {α 2 , β 2 }; A n ) to be the expected advantage of initially choosing arm 1 over arm 2 assuming optimal continuation, that is Furthermore, we use the notation and All the quantities defined above can be written more generally, substituting to the prior beta-Stacy parameters, the correspondent posterior parameters. For instance, is the expected advantage of choosing arm 1 over arm 2 at stage k ∈ N, after the observation of [x k−1 , y k−1 ] from the arms in the preceding k − 1 stages, and where A k−1 n := (a k , a k+1 , . . . , a n ).

Bandit strategies with optimal properties
, F 1 and F 2 independent, and bandit stages k ∈ {1, 2, . . . , n}, n ∈ N. We study two strategies: stay-with-a-winner and stay-with-a-winner/switch-on-a-loser strategies, following the nomenclature of Chattopadhyay (1994). As anticipated in Section 1, in a stay-with-a-winner strategy, the arm currently chosen to be observed is again observed at the next stage if its expected advantage, relative to the alternative arm, is higher than the expected advantage computed before the current observation of the arm. Therefore, an optimal arm chosen at the current stage will again be optimal at the next stage if chosen by the stay-with-a-winner strategy. We now characterize this first strategy: Definition 3.1. The stay-with-a winner strategy at stage k = 1 selects arm 1 if and arm 2 otherwise. At stage k > 1, after the observation of [x k−1 , y k−1 ], the strategy chooses to observe arm 1 if and selects arm 2 otherwise. The stay-with-a-winner break-even point at stage k − 1, k > 1, is the realized value of X k−1 from the first arm (or Y k−1 from the second arm) for which the inequality above becomes an equality, making the strategy indifferent in the choice of the two arms at the next stage k.
On the other hand, in a stay-with-a-winner/switch-on-a-loser strategy, the currently observed arm is optimal at the next stage and it will be chosen if the observation is higher than the break-even point at the current stage; if not, the optimal arm to observe at the next stage is the other one.
Definition 3.2. The stay-with-a-winner/switch-on-a-loser strategy at stage and selects arm 2 otherwise. The stay-with-a-winner/switch-on-a-loser breakeven point at stage k − 1, k < n is the realized value of X k−1 from the first arm (or Y k−1 from the second arm) for which the inequality above becomes an equality, making the strategy indifferent in the choice of the two arms at the following stage k.

Framework setting
The first (second) bandit arm is observed as long as it yields a value higher (lower) than the break-even point. Let F i be the random distribution function corresponding to arm i, F i ∼ BS(α i , β i ) discrete, with X k and Y k having supports in N for all k ≥ 1, for i = 1, 2. Omitting for notational convenience from now on the dependence of P and E on α 1 , β 1 , α 2 , β 2 , at stage 1 and for all t ∈ N, we have and similarly for Y 1 . Then the prior means of the two arms are respectively from arms 1 and 2 up to stage k − 1, the conditional expectation of any function h(X) can be computed using Theorem 2.2, and it is denoted by E[h(X k )|x k−1 ]. In the sequel, the updates a posteriori of α i and β i , i = 1, 2, follow the notation introduced in formulae (7) and (8): for instance, α 1 x k−1 is the update of α 1 after having observed x k−1 , and it is a fixed measure since x k−1 is fully observed; similarly, α 1 X1 is the random measure (since X 1 is random) that updates α 1 by taking into account the randomness of the first arm at the first stage. Then , and the posterior mean is A useful result we will often use below and that we show in the Appendix is that the expected advantage of arm 1 over arm 2 can be written as where, coherently with the notation introduced, for t ∈ N with similar construction for α 2 X1 and β 2 X1 . In general, for k ∈ {1, 2, . . . , n}: where and similarly for all other quantities of interest in (10).

Monotonicity of the expected advantage
In the next proposition we show that, given an exact or a right-censored observation X 1 = x from arm 1, the expected advantage of choosing arm 1 over arm 2 at stage 2 increases as x increases. We remark that we prove this and all subsequent results for the expected advatage at stage 2, and after the observation of arm 1 at stage 1, but identical statements can proved in the same way for the expected advantage at stage k ≥ 1 after the observation of [x k−1 , y k−1 ] up to stage k − 1, at the cost of the slight increase in the notational burden of substituting in the next propositions and theorems α 1 x and β 1 and β 1 [x k−1 ,x] and α 2 and β 2 with α 2 x k−1 and β 2 x k−1 . Furthermore, all the results are stated assuming that the arm to observe at stage 1 is the first one, but they can all similarly be stated in the case when arm 2 is observed at stage 1.
Proposition 4.1. For all α 1 , β 1 and α 2 , β 2 such that β 1 (t) ≤ β 1 (t+1)+α 1 {t+ 1}, for all t ∈ N, and for all nonincreasing discount sequences A n , Proof. By induction, for n = 1, we have Fix x * = x + 1. We first prove that μ 1 For this purpose, we study separately the t-terms in the sum of μ 1 x and μ 1 x * when t ≤ x, t = x * and t > x * . When x is an exact observation, • The t-terms with t ≤ x are the same in μ 1 x and μ 1 x * .
and the x * -term of μ 1 x * is higher than or equal to the corresponding term in μ 1 x .
the t-term of μ 1 x * is higher than or equal to the corresponding term in μ 1 Similarly, the monotonicity of μ 1 x can be proved when x is a right-censored observation: the t-terms with t ≤ x * are the same in μ 1 x and μ 1 x * , whilst for t > x * the two terms in, respectively, μ 1 x and μ 1 where the term in μ 1 x * is higher than or equal to the corresponding term in μ 1 x . Then, for n = 1 the statement is true since μ 1 x is nondecreasing in x and a 1 ≥ 0. From the induction hypothesis, we assume the monotonic property for n = m − 1 for some natural number m > 1. By (10), The first term in the right hand side of (12) is nondecreasing in x since μ 1 x is nondecreasing in x and a 1 −a 2 ≥ 0. The second and third term are nondecreasing in x from the induction hypothesis.

Existence of break-even points
We now prove the existence of stay-with-a-winner and stay-with-a-winner/ switch-on-a-loser break-even points in a bandit problem with discrete beta-Stacy processes. The following proposition is preliminary to Theorems 4.5 and 4.6.
Proposition 4.3. For all α 1 , β 1 and α 2 , β 2 as in Proposition 4.1 and for all nonincreasing discount sequences A n , Proof. The result for the x → 0 is a direct consequence of the monotonicity property shown in Proposition 4.1. We are left to prove the limit to +∞. Consider x increasing to ∞. By induction, for n = 1, μ 1 x diverges to +∞ as x → +∞ since Then, Δ {α 1 x , β 1 x }, {α 2 , β 2 }; A 1 = a 1 (μ 1 x −μ 2 ) goes to +∞ since μ 1 x is divergent and a 1 > 0. Assume now that the statement is true for n = m − 1, for some natural number m > 1. By (12), For the first term (a 1 − a 2 )(μ 1 x − μ 2 ) on the right hand side of the formula above there are two possible cases: a 1 − a 2 > 0 or a 1 − a 2 = 0. In the latter case the term is zero, while when a 1 −a 2 > 0 it diverges to +∞. For the second term, is a nondecreasing sequence in x (by Proposition 4.1), bounded below by 0 (by definition) and divergent to +∞ (by the induction hypothesis). We can then apply the monotone convergence theorem and obtain For the third term, notice that, for all y ∈ N, converges to 0 as x diverges, and it is bounded above by Δ + ({α 2 y , β 2 y }, {α 1 x=0 , β 1 x=0 }; A 1 m ) . By the dominated convergence theorem we have Remark 4.4. In Proposition 4.3 condition (13) is a sufficient condition, and the discrete beta-Stacy process is defined such that condition (2) is verified. Both conditions are satisfied when their ratio diverges, that is when This constraint does not pose restrictions, and it is satisfied, as expected, in the special cases of the simple homogeneous process and the Dirichlet process.
We finally state the following theorems, showing that there exist breakeven points determining, respectively, a stay-with-a-winner and a stay-witha-winner/switch-on-a-loser strategy. The theorems generalize Theorem 2.1 and Theorem 2.2 of Chattopadhyay (1994), proving the existence of the break-even observations in a context more general than the Dirichlet arms, at the cost of some restrictions on the choice of the parameters of the beta-Stacy process.
Theorem 4.5. For all α 1 , β 1 and α 2 , β 2 as in Proposition 4.1, for all nonincreasing discount sequences A n and n > 1, there exists a break-even point Proof. From Proposition 4.1, Δ {α 1 x , β 1 x }, {α 2 , β 2 }; A 1 n is non decreasing in x, starting from a value lower than Δ {α 1 , β 1 }, {α 2 , β 2 }; A n and going to infinity (Proposition 4.3). This is enough to claim that there exists a break-even point b which satisfies the properties in the theorem.
Theorem 4.6. For all α 1 , β 1 and α 2 , β 2 as in Proposition 4.1, for all nonincreasing discount sequences A n and n > 1, if the condition

holds, there exists a break-even point
As in the proof of Theorem 4.5, there exists a point d satisfying the properties.
Remark 4.7. Sufficient condition (15) arises because the support of the base measure of the beta-Stacy process is bounded below by zero. Both Clayton and Berry (1985) and Chattopadhyay (1994) notice that when the support is bounded, additional conditions at the boundaries are sufficient for the existence of break-even observations. In particular, the condition intuitively means that if a very bad observation from arm 1 is extracted at stage 1 (x close to 0), the alternative arm 2 is preferred under the current strategy. Note that in Theorem 4.5 it is superfluous a condition of the kind Δ {α 1 x=0 , β 1 x=0 }, {α 2 , β 2 }; A 1 n ≤ Δ {α 1 , β 1 }, {α 2 , β 2 }; A n , that is a condition that imposes a reduction in the expected advantage of arm 1 after the observation of x = 0: the worst observation x = 0 always causes a decrease in the expected advantage. On the other hand, without condition (15) in Theorem 4.6 we cannot exclude cases of prior values of α i and β i , i = 1, 2 such that μ 1 >> μ 2 , with an expected advantage that does not change sign after the observation of x = 0.

Framework setting
In the continuous beta-Stacy discrete-stage two-armed problem, X k and Y k , respectively from arm 1 and arm 2 at stage k ∈ {1, 2, . . . , n}, can assume values in R + and α 1 and β 1 (and also α 2 and β 2 ) are, respectively, a continuous measure and a positive function, both defined on R + . α 1 and α 2 are assumed a priori to have no discontinuity points.
Recalling that we omit the dependence on α 1 , α 2 , β 1 and β 2 , the results in Section 2.1 say that, for t ∈ R + , where [0,t] denotes the product integral, an operator commonly used in the survival analysis literature. For any partition where the limit is taken over all partitions of the interval [k 1 , k 2 ] with l m approaching zero, for k 1 < k 2 both in R + . See Gill and Johansen (1990) for a survey of applications of product integrals to survival analysis. We can compute, in analogy with the discrete case, and, similarly, assuming, without loss of generality, that μ 1 ≤ μ 2 . For the stage k ∈ {1, 2, . . . , n}, from the results in Section 2.3, and, partitioning x k−1 = [x exact k−1 , x cens k−1 ] for respectively exact and censored observations, the posterior mean is

Monotonicity of the expected advantage
The function Δ({α 1 , β 1 }, {α 2 , β 2 }; A n ) can be expressed as in (10). In the following propositions in the present and next subsections we will study its properties of monotonicity and continuity, with the aim of proving in Section 5.4 the existence of break-even observations of stay-with-a-winner and stay-with-awinner/switch-on-a-loser strategies.
Proposition 5.1. For all α 1 , β 1 and α 2 , β 2 such that − ∂ ∂t β 1 (t) ≥ ∂ ∂t α 1 (t), t ∈ R + , and for all nonincreasing discount sequences A n , Proof. By induction, for n = 1, and x censored to the right, We first show that μ 1 x is nondecreasing in x, for x being censored to the right. Notice that μ 1 x can be written as The integrand in μ 1 x as a function of t has a discontinuity point when t = x, but its value at this point is ignored since it does not contribute to the evaluation of μ 1 x . Take now any x * > x, and separate the cases t ≤ x, t ∈ (x, x * ) and t ≥ x * : 3390 S. Peluso et al. • When t ≤ x, the integrands in μ 1 x and μ 1 x * are the same.
with the integrand in μ 1 x * always greater than or equal to the one in μ 1 x . • Finally, when t ∈ (x, x * ), the integrands in μ 1 x and μ 1 x * are, respectively, proving that μ 1 x * ≥ μ 1 x and that the statement is true for n = 1. On the other hand, when x is not censored and μ 1 x * ≥ μ 1 x for all x * > x, if and only if P(X 2 = x|x), the probability of X 2 from arm 1 at stage 2 being equal to the previous exact observation, is nondecreasing in x. This condition is equivalent to − ∂ ∂t β 1 (t) ≥ ∂ ∂t α 1 (t), for t ∈ R + , as required in the proposition.
By induction, assuming the monotonicity property for n = m − 1, with some natural number m > 1, the proof is completed along the lines of Proposition 4.1.

Continuity of the expected advantage
Proposition 5.3. For all α 1 , β 1 and α 2 , β 2 and all nonincreasing discount sequences A n , the expected advantage Δ {α 1 x , β 1 x }, {α 2 , β 2 }; A n is a continuous function of x, for x ∈ R + censored to the right.
Proof. It is enough to show that, for any increasing or decreasing sequence {x} converging to x 0 ∈ R + , We provide the proof only for an increasing sequence {x}, since the decreasing sequence case is similar. By induction, first fix n = 1, so that The continuity in x is shown through the continuity of μ 1 x . Taking any increasing sequence converging to x 0 , then where the last equality is justified by the continuity in x of To finally see that μ 1 x is continuous, we need to prove the continuity in x of the function Note that the function H is a parameterized Riemann integral, whose integration extremes are also dependent on the parameter. H is given by the composition of two functions: Assume now that the statement is true for n = m − 1, and some natural number m > 1. By (12), The first term (a 1 − a 2 )(μ 1 x − μ 2 ) on the right hand side is continuous in x (from the continuity of μ 1 x ). For the second term, note that is a nondecreasing sequence in x (by Proposition 5.1), bounded below by 0 (by its definition) and convergent to Δ + ({α 1 [x0,X2] , β 1 [x0,X2] }, {α 2 , β 2 }; A 1 m ) (by the induction hypothesis). We can then apply the monotone convergence theorem: For the third term, notice that, for y ∈ R + , as x converges (by the induction hypothesis), and it is bounded above by By the dominated convergence theorem, proving continuity for the generic bandit horizon n.

Existence of break-even points
Proposition 5.4. For all α 1 , β 1 and α 2 , β 2 as in Proposition 5.1, for all x ∈ R + and all nonincreasing discount sequences A n , Proof. The case for x = 0 is an immediate consequence of Proposition 5.1. To study the case where x diverges, we proceed by induction. Note that for n = 1, where the last equality is true since The rest of the proof follows the same lines as the proof of Proposition 4.3.
Remark 5.5. Coherently with the conditions required in Proposition 4.3 for the discrete beta-Stacy bandit problem, in the above proposition the additional condition ∞ 0 dα 1 (t)/(β 1 (t) + 1) < ∞ is a sufficient condition. Note that the beta-Stacy process is defined such that ∞ 0 dα 1 (t)/β 1 (t) = ∞. These two improper integrals should have a different asymptotic behavior, a condition that is verified when, from the limit comparison test for integrals, the limit of the ratio of the two integrands is different from 1, that is when For finite β 1 , this is satisfied, and, as expected, includes the special cases of the simple homogeneous process and the Dirichlet process. In short, the additional constraint rules out cases of exploding β 1 . Usually, β 1 is fixed such that β 1 (t) = M · F 0 [t, ∞), converging to 0 as t diverges (see Walker and Muliere 1997).
Remark 5.8. For Theorems 5.6 and 5.7, Remark 4.7 is still valid, on the sufficiency of an additional boundary condition (in Theorem 5.7, but not in 5.6) for finding break-even points in bandit problems with base measures having bounded supports.

Discrete beta-Stacy bandit examples
Consider the discrete beta-Stacy two-armed bandit problem, with M i = α i (N), i = 1, 2 be the total masses of the measures α 1 and α 2 . A higher value of M i is interpreted as a stronger prior knowledge of the beta-Stacy process related to arm i. We observe censored to the right observations, fix the bandit horizon to n = 3 and the discount sequence is A 3 = (1, 0.9, 0.8). Choosing higher values for n is feasible and to higher values correspond higher processing times. Denoting by X k ) the l-th sampled extraction from arm 1 (arm 2) at stage k, we first sample X (l) 1 and Y (l) 1 from the two arms, for l = 1, . . . , T with T = 100. Then for each X (l) 1 we sample T times X 2 from the first arm, and for each Y (l) 1 we sample T times Y 2 from the second arm. Sampling from prior and posterior beta-Stacy processes is done, respectively, with Algorithm A and B in Al Labadi and Zarepour (2013). See also De Blasi (2007) for an alternative way of simulating from the beta-Stacy process.

Discrete numerical example 1
Fix, for all t ∈ N \ {0}, α 1 {t} = 0.1M 1 · 0.9 t−1 and β 1 (t) = M 1 · 0.9 t for the first arm, and α 2 {t} = 0.08M 2 · 0.92 t−1 and β 2 (t) = M 2 · 0.92 t and for the second arm. For t = 0, i = 1, 2, α i {t} = β i (t) = 0, and for all t / ∈ N, i = 1, 2, α i {t} = 0 and β i (t) = β i ( t ), where t is the largest integer lower or equal to t. Note that μ 1 < μ 2 a priori and that different values of M 1 and M 2 do not affect the prior means, μ 1 and μ 2 , but only the posterior means, since M 1 · 0.9 j 0.1M 1 · 0.9 j−1 + M 1 · 0.9 j = ∞ t=1 0.9 t = 9, and similar calculations show that μ 2 = 11.5. The assumption of Proposition 4.1 is satisfied, since for all t ∈ N\{0} we have β 1 (t) = β 1 (t+1)+α 1 {t+1} = M 1 ·0.9 t and β 2 (t) = β 2 (t + 1) + α 2 {t + 1} = M 2 · 0.92 t , for t = 0 and i = 1, 2 we have β i (t) = 0 < M i = β i (t + 1) + α i {t + 1}, and for t / ∈ N and i = 1, 2 we have For each scenario, we evaluate μ 1 x1 , μ 1 x2 , μ 2 y1 and μ 2 y2 ; we then evaluate Δ {α 1 , β 1 }, {α 2 , β 2 }; A 3 , reported in Table 1 for different values of M 1 and M 2 . There is a tendency for Δ {α 1 , β 1 }, {α 2 , β 2 }; A 3 to increase in M 2 and decrease in M 1 , holding everything else constant. This result is coherent with the exploitation-exploration trade-off mentioned in the Introduction, and suggests that the less is known about the arm, the more appealing is to select the arm, since more information can be gained from its exploration: when M 1 increases, more weight is given to the prior belief of arm 1, that is then considered better known, with consequent higher tendency of exploring arm 2, as reflected in the lower expected advantage of arm 1 over arm 2. A specular reasoning on M 2 leads to a higher expected advantage of arm 1 over arm 2 as M 2 increases. Furthermore, as both M 1 and M 2 increase, prior information on both arms assumes more relevance, relative to the observation coming from the observation of the arms, up to the case where Δ {α 1 , β 1 }, {α 2 , β 2 }; A 3 approaches μ 1 − μ 2 = −2.5 (the prior mean difference), with no impact of the observations on the choice of the arms to observe. When Δ {α 1 , β 1 }, {α 2 , β 2 }; A 3 is positive, the optimal arm is the first one, and viceversa when is negative. Most of the times, the difference in the prior means makes the second arm the optimal one, except in cases with M 1 << M 2 : these are situations where the higher prior uncertainty (lower M 1 ) of the first arm, relative to the higher prior confidence in the second arm (larger M 2 ), makes the first arm preferable to be explored, even if a priori arm 2 is believed to be better.

Discrete numerical example 2
The beta-Stacy parameters in the two-armed bandit problem are fixed, for i = 1, 2 and t ∈ N, as where c i , h i ∈ N and h i < c i . For all t / ∈ N, i = 1, 2, α i {t} = 0 and β i (t) = β i ( t ). Note that, for i = 1, 2, and therefore we can fix the prior means through c 1 and c 2 . The assumption of Proposition 4.1 is satisfied, since for i = 1, 2 and t ≤ c (16) and (17), for different variability around the base measure (M 2 ) and different base measure variability (h 2 ) of the beta-Stacy process from the second arm. The prior means are both equal to μ 1 = μ 2 = 20, and M 1 = h 1 = 1.

with parameters as specified in equations
The parameter h i is positively related to the variability of the base measure of the beta-Stacy process related to arm i, whilst M i is negatively related to the variability around the base measure. We fix μ 1 = μ 2 = 20, and M 1 = h 1 = 1, to see how the expected advantage of arm 1 over the other arm is affected by a change in h 2 and in M 2 . In this way we set up an experiment in which we can isolate the effect on the expected advantage of arm 1 of a change in the prior variability of the beta-Stacy base measure related to arm 2 (a change in h 2 ) from the effect of a change in the prior belief in the base measure (a change in M 2 ). The prior means are fixed equal to avoid the results being affected by a dominant prior mean. For instance, holding fixed M 2 , an increase in h 2 leaves unaltered the prior mean of arm 2, but the support of α 2 is more spread, with a consequent increase in the variability of responses from arm 2 and a more convenient exploration of arm 2.
In Figure 1 we report the value of Δ {α 1 , β 1 }, {α 2 , β 2 }; A 3 for different h 2 and M 2 . The dotted line, corresponding to the case h 1 − h 2 = 0, shows how a lower variability (higher M 2 ) around the base measure of arm 2, makes this arm less interesting to explore, in favor of arm 1. The same effect is caused by a change in the variability of the base measure of arm 2: for h 2 − h 1 < 0 and for M 1 = M 2 = 1, arm 1 is preferred, up to a Δ {α 1 , β 1 }, {α 2 , β 2 }; A 3 ≈ 6 for h 2 − h 1 = −10. Viceversa, higher positive values of h 2 − h 1 correspond to higher preference for arm 2. Furthermore, in the considered setting the effect of a change in M 2 seems to dominate when it becomes very large: as we increase M 2 , the distances among the scenarios with different h 2 decrease and concentrate on positive expected advantages of the first arm over the second one.
In the bottom-left plot of Figure 2 we report Δ {α 1 x , β 1 x }, {α 2 , β 2 }; A 1 3 when x ∈ R + is right-censored, and the expected advantage of arm 1 if the data were incorrectly supposed to be exact. In other words, we compare the beta-Stacy bandit problem with the corresponding Dirichlet bandit problem that ignores the censorship, to quantify the difference between the two and highlight the relevance of properly accounting for right-censored data. There is a range of values from 9.12 to 16.48 at which the Dirichlet bandit problem would take the wrong strategy, since Δ {α 1 x , β 1 x }, {α 2 , β 2 }; A 1 3 would be of opposite sign, relative to the corresponding beta-Stacy quantity. The break-even point for the Dirichlet bandit is too low, since it judges the observations to be exact and therefore does not account for the increased chance of observing higher values in future stages. If we repeat the experiment 150 times for each value of x from 1 to 30, we can compute the probability the Dirichlet bandit being in error in the choice of the optimal arm, after the observation of x from arm 1 at stage 1. This probability is reported in the bottom-right plot of Figure 2.

Conclusions and further directions
We have studied Bayesian nonparametric bandit problems with right-censored data, where two independent arms are generated by beta-Stacy processes (Walker and Muliere 1997). The proposed framework extends the one-armed and twoarmed Dirichlet bandit problem of Clayton and Berry (1985) and Chattopadhyay (1994) since the beta-Stacy process reduces to the Dirichlet process for a special choice of the process parameters and in the absence of censored observations. We have shown some properties of the expected advantage of the first arm over the second arm, and, under non-restrictive constraints on the process parameters, the existence of stay-with-a-winner and stay-with-a-winner/switchon-a-loser break-even points that partially characterize optimal strategies.

Relation to delayed responses
A stream of literature close to the proposed setting is that on bandit problems with delayed responses: new subjects arrive in the bandit problem and optimal assigment to arms has to be performed before having observed the responses of past subjects. Delayed bandits have been first proposed by Eick (1985Eick ( , 1988a for the one-armed and two-armed bandit and by Eick (1988b) for the multi-armed bandit problem. Among later significant developements, Stout (2001, 2006) proposed the two-armed Bernoulli bandit problem with subjects arriving according to a Poisson process and exponential responses; Wang (2000) studied boundary and monotonicity properties of the break-even point and finite-stage optimal stopping of the Eick (1988a) bandit problem; the onearmed Eick (1988a) with continuous stages was introduced by Wang and Bickis (2003); Caro and Yoo (2010) proved that discrete-stage bandit problems with stationary random delays satisfy the indexability criterion (Whittle 1988) as long as the delayed responses do not cross over; a computationally efficient approximation to the solution of a multi-armed bandit problem with delayed responsed was implemented in Guha, Munagala and Pál (2013).
The relation between censored and delayed responses lays in the fact that past responses not yet observed in delayed bandits are typically treated as censored observations. But there is a significant difference that prevents the approach of the present paper fitting into the framework of delayed responses: in delayed bandits the censored observations are potentially exact observations with exact value not yet realized, whilst in our setting a censored observation will never become exact. In the patients' treatment example, the censored observation in the delayed bandit is the survival time returned at the current stage by a treated patient not died yet, and this piece of data will become exact at the stage at which the patient will die. On the contrary, we are not able to treat delayed responses since the time index of the beta-Stacy process driving the generation of the response is detached from the stage index of the bandit problem: the new bandit stage begins (the new subject arrives) only when past responses (exact or censored) are given. The extension of our setting to the case with delayed responses is an interesting research question beyond the scope of the present paper, but similarities between the two approaches give first suggestions on how this extension can be performed: from Walker and Muliere (1997), the beta-Stacy process can be expressed in a product form that resembles and extends the geometric distribution of the delayed responses in Eick (1985Eick ( , 1988a.

Multi and one-armed bandits
The extension to multi-armed contextual bandits (Langford and Zhang 2008) can be implemented by introducing dependence of the arm parameters on external regressors, or introducing dependence between Bayesian nonparametric arms through partial exchangeability (de Finetti 1938(de Finetti , 1959, for instance with the mixture of Dirichlet processes of Antoniak (1974), the Bivariate Dirichlet process of Walker and Muliere (2003) or the Bivariate beta-Stacy process of Muliere, Bulla and Walker (2007). In this direction, Battiston, Favaro and Teh (2016) adopt hierarchical Poisson-Dirichlet processes in multi-armed bandit problems. The introduction of dependence between the arms also suggests the extension to restless bandit problems, in which the parameters of the beta-Stacy process of one arm are updated even if no response from that arm is observed, but as a consequence of the observation of a response from a dependent arm.
We highlight that in the setting of the current paper the parameters associated to the beta-Stacy process of arm i, i = 1, 2, are updated according to the rules given in Section 2.3 only when a new response is observed from arm i, ruling out from the present setting restless bandits (Whittle 1988). This feature, together with the independence of the two arms, makes the current setting (and the one with a generic number of arms) a classic bandit problem that, in the special case of a discount sequence that is geometric at least up to a constant of proportionality and common to all arms, is solvable in priciple by a Gittins index policy (Gittins 1979). The feasibility of the Gittins index calculation associated to each beta-Stacy arm is not trivial and is object of current investigation from the authors in the aim of generalizing the proposed setting to a multi-armed framework. We conjecture that one of the methods surveyed in Chakravorty and Mahajan (2014) could be adopted, but an effort is needed to derive the transition probability matrix corresponding to the Markovian update of the bandit process.
On the other hand, the present framework is reduced to a one-arm beta-Stacy bandit problem if we let the total measure M 2 := α 2 (R + ) diverge to +∞. The value of M 2 gives indication on the strength of the prior belief in the base measure of the beta-Stacy process associated to arm 2. The extreme case of an infinite M 2 means de facto a sure knowledge of the distribution driving the generation of the responses from arm 2. The result would be an arm 1 whose responses are driven by a beta-Stacy process whose parameters are updated according to the specified rules as new observations from arm 1 are collected; and an arm 2 with known distribution equal to the mean distribution of the beta-Stacy process of arm 2, whose parameters are never updated, regardless of the responses observed from arm 2. All subsequent results would remain the same, with the exception that the mean response of arm 2, affecting the expected bandit payoff, would not change from stage to stage, but it would remain fixed to its prior value.

Other directions of investigation
Our framework can be further extended to different bandit problems. First, the common formulation of the Bernoulli bandit can be replicated through the choice of Bernoulli base measures, centered on success probabilities that are learnt as observations are collected. Second, semi-uniform strategies with greedy behaviour can be addressed: epsilon-greedy and epsilon-first strategies (Watkins 1989;Sutton and Barto 1998) that dedicate a proportion of phases to, respectively, random and purely exploratory phases, can be derived by randomizing the reinforcement learning mechanism of the arms' parameters (Muliere, Paganoni and Secchi 2006); epsilon-decreasing and VBDE strategies (Cesa-Bianchi and Fisher 1998; Tokic 2010) would require a beta-Stacy parameter update mechanism dependent on the number of steps or on the values extracted from the arms. Third, the sequential nature of the Bayesian framework and the flexibility of nonparametric priors permit to handle more general cases of non-stationary bandit problems (Garivier and Moulines 2008), where the underlying base measure of the beta-Stacy processes can change after some stage.