A Quasi-Bayesian Perspective to Online Clustering

When faced with high frequency streams of data, clustering raises theoretical and algorithmic pitfalls. We introduce a new and adaptive online clustering algorithm relying on a quasi-Bayesian approach, with a dynamic (i.e., time-dependent) estimation of the (unknown and changing) number of clusters. We prove that our approach is supported by minimax regret bounds. We also provide an RJMCMC-flavored implementation (called PACBO, see https://cran.r-project.org/web/packages/PACBO/index.html) for which we give a convergence guarantee. Finally, numerical experiments illustrate the potential of our procedure.


Introduction
Online learning has been extensively studied these last decades in game theory and statistics (see Cesa-Bianchi and Lugosi, 2006, and references therein). The problem can be described as a sequential game: a blackbox reveals at each time t some z t ∈ Z . Then, the forecaster predicts the next value based on the past observations and possibly other available information. Unlike the classical statistical framework, the sequence (z t ) is not assumed to be a realization of some stochastic process. Research efforts in online learning began in the framework of prediction with expert advice. In this setting, the forecaster has access to a set { f e,t ∈ D : e ∈ E } of experts' predictions, where E is a finite set of experts (such as deterministic physical models, or stochastic decisions). Predictions made by the forecaster and experts are assessed with a loss function : D × Z −→ R + . The goal is to build a sequenceẑ 1 , . . . ,ẑ T (denoted by (ẑ t ) 1:T in the sequel) of predictions which are nearly as good as the best expert's predictions in the first T time rounds, i.e., satisfying uniformly over any sequence (z t ) the following regret bound where ∆ T (E ) is a remainder term. This term should be as small as possible and in particular sublinear in T. When E is finite, and the loss is bounded in [0, 1] and convex in its first argument, an explicit ∆ T (E ) = (T/2) log |E | is given by Cesa-Bianchi and Lugosi (2006). The optimal forecaster is then obtained by forming the exponentially weighted average of all experts. For similar results, we refer the reader to Littlestone and Warmuth (1994), Cesa-Bianchi et al. (1997).
Online learning techniques have also been applied to the regression framework. In particular, sequential ridge regression has been studied by Vovk (2001). For any t = 1, . . . , T, we now assume that z t = (x t , y t ) ∈ R d × R. At each time t, the forecaster gives a predictionŷ t of y t , using only newly revealed side information x t and past observations (x s , y s ) 1:(t−1) . Let 〈·, ·〉 denote the scalar product in R d . A possible goal is to build a forecaster whose performance is nearly as good as the best linear forecaster f θ : x → 〈θ, x〉, i.e., such that uniformly over all sequences (x t , y t ) 1:T , where ∆ T (d) is a remainder term. This setting has been addressed by numerous contributions to the literature. In particular, Azoury and Warmuth (2001) and Vovk (2001) each provide an algorithm close to the ridge regression with a remainder term ∆ T (d) = O (d log T).
The purpose of the present work is to generalize the aforecited framework to the clustering problem, which has attracted attention from the machine learning and streaming communities. As an example, Guha et al. (2003), Barbakh and Fyfe (2008) and Liberty et al. (2016) study the so-called data streaming clustering problem. It amounts to cluster online data to a fixed number of groups in a single pass, or a small number of passes, while using little memory. From a machine learning perspective, Choromanska and Monteleoni (2012) aggregate online clustering algorithms, with a fixed number K of centers. The present paper investigates a more general setting since we aim to perform online clustering with an unfixed and changing number K t of centers. To the best of our knowledge, this is the first attempt of the sort in the literature. Let us stress that our approach only requires an upper bound p to K t , which can be either a constant or an increasing function of the time horizon T.
Our approach strongly relies on a quasi-Bayesian methodology. The use of quasi-Bayesian estimators is especially advocated by the PAC-Bayesian theory which originates in the machine learning community in the late 1990s, in the seminal works of Shawe-Taylor and Williamson (1997) and McAllester (1999a,b) (see also Seeger, 2002Seeger, , 2003. In the statistical learning community, the PAC-Bayesian approach has been extensively developed by Catoni (2004Catoni ( , 2007, Audibert (2004) and Alquier (2006), and later on adapted to the high dimensional setting Tsybakov (2007, 2008), Alquier andLounici (2011), Alquier andBiau (2013), Guedj and Alquier (2013), Guedj and Robbiano (2015) and Alquier and Guedj (2017). In a parallel effort, the online learning community has contributed to the PAC-Bayesian theory in the online regression setting (Kivinen and Warmuth, 1999). Audibert (2009) andGerchinovitz (2011) have been the first attempts to merge both lines of research. Note that our approach is quasi-Bayesian rather than PAC-Bayesian, since we derive regret bounds (on quasi-Bayesian predictors) instead of PAC oracle inequalities.
Our main contribution is to generalize algorithms suited for supervised learning to the unsupervised setting. Our online clustering algorithm is adaptive in the sense that it does not require the knowledge of the time horizon T to be used and studied. The regret bounds that we obtain have a remainder term of magnitude T log T and we prove that they are asymptotically minimax optimal.
The quasi-posterior which we derive is a complex distribution and direct sampling is not available. In Bayesian and quasi-Bayesian frameworks, the use of Markov Chain Monte Carlo (MCMC) algorithms is a popular way to compute estimates from posterior or quasiposterior distributions. We refer to the comprehensive monograph Robert and Casella (2004) for an introduction to MCMC methods. For its ability to cope with transdimensional moves, we focus on the Reversible Jump MCMC algorithm from Green (1995), coupled with ideas from the Subspace Carlin and Chib algorithm proposed by Dellaportas et al. (2002) and Petralias and Dellaportas (2013). MCMC procedures for quasi-Bayesian predictors were firstly considered by Catoni (2004) and Dalalyan and Tsybakov (2012). Alquier and Biau (2013), Guedj and Alquier (2013) and Guedj and Robbiano (2015) are the first to have investigated the RJMCMC and Subspace Carlin and Chib techniques and we show in the present paper that this scheme is well suited to the clustering problem.
The paper is organised as follows. Section 2 introduces our notation and our online clustering procedure. Section 3 contains our mathematical claims, consisting in regret bounds for our online clustering algorithm. Remainder terms which are sublinear in T are obtained for a model selection-flavored prior. We also prove that these remainder terms are minimax optimal. We then discuss in Section 4 the practical implementation of our method, which relies on an adaptation of the RJMCMC algorithm to our setting. In particular, we prove its convergence towards the target quasi-posterior. The performance of the resulting algorithm, called PACBO, is evaluated on synthetic data. For the sake of clarity, proofs are postponed to Section 5. Finally, Appendix A contains an extension of our work to the case of a multivariate Student prior along with additional numerical experiments.

A quasi-Bayesian perspective to online clustering
Let (x t ) 1:T be a sequence of data, where x t ∈ R d . Our goal is to learn a time-dependent parameter K t and a partition of the observed points into K t cells, for any t = 1, . . . , T. To this aim, the output of our algorithm at time t is a vectorĉ t = (ĉ t,1 ,ĉ t,2 , . . . ,ĉ t,K t ) of K t centers in R dK t , depending on the past information (x s ) 1:(t−1) and (ĉ s ) 1:(t−1) . A partition is then created by assigning any point in R d to its closest center. When x t is newly revealed, the instantaneous loss is computed as where | · | 2 is the 2 -norm in R d . In what follows, we investigate regret bounds for cumulative losses. Given a measurable space Θ (embedded with its Borel σ-algebra), we let P (Θ) denote the set of probability distributions on Θ, and for some reference measure ν, we let P ν (Θ) be the set of probability distributions absolutely continuous with respect to ν. For any probability distributions ρ, π ∈ P (Θ), the Kullback-Leibler divergence K (ρ, π) is defined as Note that for any bounded measurable function h : Θ → R and any probability distribution This result, which may be found in Csiszár (1975) and Catoni (2004, Equation 5.2.1), is critical to our scheme of proofs. Further, the infimum is achieved at the so-called Gibbs quasi-posteriorρ, defined by We now introduce the notation to our online clustering setting. Let C = ∪ p k=1 R dk for some integer p ≥ 1. We denote by q a discrete probability distribution on the set 1, p := {1, . . . , p}. For any k ∈ 1, p , let π k denote a probability distribution on R dk . For any vector of cluster centers c ∈ C , we define π(c) as Note that (4) may be seen as a distribution over the set of partitions of R d : any c ∈ C corresponds to a partition of R d with at most p cells. In the sequel, we denote by c ∈ C a partition of R d and by π ∈ P (C ) a prior over this set. Let λ > 0 be some (inverse temperature) parameter. At each time t, we observe x t and a random partitionĉ t+1 ∈ C is sampled from the Gibbs quasi-posterior This quasi-posterior distribution will allow us to sample partitions with respect to the prior π defined in (4) and bent to fit past observations through the following cumulative loss where the latter term is introduced in order to cope with the non-convexity of the loss and is essential to make the online variance inequality hold true (as discussed in Audibert, 2009, Section 4.2). S t (c) consists in the cumulative loss of c in the first t rounds and a term that controls the variance of the next prediction. Note that since (x t ) 1:T is deterministic, no likelihood is attached to our approach, hence the terms "quasi-posterior" forρ t+1 and "quasi-Bayesian" for our global approach. The resulting estimate is a realization ofρ t+1 with a random number K t of cells. This scheme is described in Algorithm 1. Note that this algorithm is an instantiation of Audibert's online SeqRand algorithm (Audibert, 2009, Section 4) to the special case of the loss defined in (2). However SeqRand does not account for adaptive rates λ = λ t , as discussed in the next section.

4:
Get the data x t 5: 6: End for

Minimax regret bounds
Let E c∼ν stands for the expectation with respect to the distribution ν of c (abbreviated as E ν where no confusion is possible). We start with the following pivotal result.
Proposition 1 For any sequence (x t ) 1:T ∈ R dT , for any prior distribution π ∈ P (C ) and any λ > 0, the procedure described in Algorithm 1 satisfies Proposition 1 is a straightforward consequence of Audibert (2009, Theorem 4.6) applied to the loss function defined in (2), the partitions C , and any prior π ∈ P (C ).

Preliminary regret bounds
In the following, we instantiate the regret bound introduced in Proposition 1. Distribution q in (4) is chosen as the following discrete distribution on the set 1, p When η > 0, the larger the number of cells k, the smaller the probability mass. Further, π k in (4) is chosen as a product of k independent uniform distributions on 2 -balls in R d : where R > 0, Γ is the Gamma function and is an 2 -ball in R d , centered in 0 ∈ R d with radius r > 0. Finally, for any k ∈ 1, p and any Corollary 1 For any sequence (x t ) 1:T ∈ R dT and any p ≥ 1, consider π defined by (4), (6) and (7) with η ≥ 0 and R ≥ max t=1,...,T |x t | 2 . If λ ≥ (d + 2)/(2TR 2 ), the procedure described in Algorithm 1 satisfies Note that inf c∈C (k,R) T t=1 (c, x t ) is a non-increasing function of the number k of cells while the penalty is linearly increasing with k. Small values for λ (or equivalently, large values for R) lead to small values for k. The additional term induced by the complexity of C = k=1,...,p R dk is log p. The calibration λ = (d + 2)/(2 TR 2 ) yields a sublinear remainder term in the following corollary.
Corollary 2 Under the previous notation with λ = (d + 2)/(2 TR 2 ) and R ≥ max t=1,...,T |x t | 2 , the procedure described in Algorithm 1 satisfies Let us assume that the sequence x 1 , . . . , x T is generated from a distribution with k ∈ 1, p clusters. We then define the expected cumulative loss (ECL) and oracle cumulative loss (OCL) as Then Corollary 2 yields where J is a constant depending on d, R and log p. In (10) the regret of our randomized procedure, defined as the difference between ECL and OCL is sublinear in T. However, whenever k > p, the term k T log(4 T) emerges and the bound in Corollary 2 is deteriorated.
Finally, note that the dependency in k in the right-hand side of (9) may be improved by choos- T and assuming p = O log 2 T . This allows to achieve the optimal dependency in k instead of k in (9) and (10) However the assumption p = O log 2 T may appear unrealistic in the online clustering setting, as p may grow with T at a faster rate than log 2 T. The dependency in k in (10) is the price to pay for a general framework.

Adaptive regret bounds
The time horizon T is usually unknown, prompting us to choose a time-dependent inverse temperature parameter λ = λ t . We thus propose a generalization of Algorithm 1, described in Algorithm 2.

6: End for
This adaptive algorithm is supported by the following more involved regret bound.
Theorem 1 For any sequence (x t ) 1:T ∈ R dT , any prior distribution π on C , if (λ t ) 0:T is a nonincreasing sequence of positive numbers, then the procedure described in Algorithm 2 satisfies If λ is chosen in Proposition 1 as λ = λ T , the only difference between Proposition 1 and Theorem 1 lies on the last term of the regret bound. This term will be larger in the adaptive setting than in the simpler non-adaptive setting since (λ t ) 0:T is non-increasing. In other words, here is the price to pay for the adaptivity of our algorithm. However, a suitable choice of λ t allows, again, for a refined result.
Corollary 3 For any deterministic sequence (x t ) 1:T ∈ R dT , if q and π k in (4) are taken respectively as in (6) and (7) with η ≥ 0 and R ≥ max t=1,...,T |x t | 2 , if λ t = (d + 2)/ 2 tR 2 for any t ∈ 1, T and λ 0 = 1, then the procedure described in Algorithm 2 satisfies Therefore, the price to pay for not knowing the time horizon T (which is a much more realistic assumption for online learning) is a multiplicative factor 2 in front of the term 81(d+2)R 2 4 T. This does not degrade the rate of convergence T log T.

Minimax regret
This section is devoted to the study of the minimax optimality of our approach. The regret bound in Corollary 3 has a rate T log T, which is not a surprising result. Indeed, many online learning problems give rise to similar bounds depending also on the properties of the loss function. However, in the online clustering setting, it is legitimate to wonder wether the upper bound is tight, and more generally if there exists other algorithms which provide smaller regrets. The sequel answers both questions in a minimax sense.
Let us first denote by |c| the number of cells for a partition c ∈ C . We also introduce the following assumption.
Assumption H (s): Let R > 0 and T ∈ N * . There exists an index s ∈ 1, p such that c T,R = s, where Note that several partitions may achieve this infimum. In that case, we adopt the convention that c T,R is any such partition with the smallest number of cells. Assumption H (s) means that (x t ) 1:T could be well summarized by s cells since the infimum is reached for the partition c T,R . We introduce the set For Algorithm 2, we have from Corollary 3 that Then for any s ∈ N * , R > 0, our goal is to obtain a lower bound of the form The first infimum is taken over all distributions (ρ t ) 1:T whose support is ∪ (11) where X t , t = 1, . . . , T are i.i.d with distribution µ defined on ω s,R and µ T stands for the joint distribution of (X 1 , . . . , X T ). Unfortunately, in (11), since the infimum is taken over any distribution (ρ t ), there is no restriction on the number of cells of each partitionĉ t . Then, the left hand side of (11) could be arbitrarily small or even negative and the lower bound does not match the upper bound of Corollary 3. To handle this, we need to introduce a penalized term which accounts for the number of cells of each partition to the loss function . The upcoming theorem provides minimax results for an augmented value V T (s) defined as In (12), we add a term which penalizes the number of cells of each partition. To capture the asymptotic behavior of V T (s), we derive an upper bound for the penalized loss in (12). This is done in the following theorem which combines an upper and lower bound for the regret, hence proving that it is minimax optimal.
Theorem 2 Let s ∈ N * , R > 0 such that where x represents the largest integer that is smaller than x. If T satisfies T d+2 2 ≥ 8R 2d log T, then The lower bound on V T (s)/T is asymptotically of order log T/ T. Note that Bartlett et al. (1998) obtained the less satisfying rate 1/ T, however holding with no restriction on the number of cells retained in the partition whereas our claim has to comply with (13). This is the price to pay for our additional log T factor. Note however that this price is mild as s is no longer upper bounded whenever T or R grow to +∞, casting our procedure onto the online setting where the time horizon is not assumed finite and the number of clusters is evolving along time.
As a conclusion to the theoretical part of the manuscript, let us summarize our results. Regret bounds for Algorithm 1 are produced for our specific choice of prior π (Corollary 1) and with an involved choice of λ (Corollary 2). For the adaptive version Algorithm 2, the pivotal result is Theorem 1, which is instantiated for our prior in Corollary 3. Finally, the lower bound is stated in Theorem 2, proving that our regret bounds are minimax whenever the number of cells retained in the partition satisfies (13). We now move to the implementation of our approach.

The PACBO algorithm
Since direct sampling from the Gibbs quasi-posterior is usually not possible, we focus on a stochastic approximation in this section, called PACBO (available in the companion eponym R package from Li, 2016). Both implementation and convergence (towards the Gibbs quasiposterior) of this scheme are discussed. This section also includes a short numerical experiment on synthetic data to illustrate the potential of PACBO compared to other popular clustering methods.

Structure and links with RJMCMC
In Algorithm 1 and Algorithm 2, it is required to sample at each t from the Gibbs quasiposteriorρ t . Sinceρ t is defined on the massive and complex-structured space C (let us recall that C is a union of heterogeneous spaces), direct sampling fromρ t is not an option and is much rather an algorithmic challenge. Our approach consists in approximatingρ t through MCMC under the constraint of favouring local moves of the Markov chain. To do it, we will use resort to Reversible Jump MCMC (Green, 1995), adapted with ideas from the Subspace Carlin and Chib algorithm proposed by Dellaportas et al. (2002) and Petralias and Dellaportas (2013). Since sampling fromρ t is similar for any t = 1, . . . , T, the time index t is now omitted for the sake of brevity.
Let (k (n) , c (n) ) 0≤n≤N , N ≥ 1 be the states of the Markov Chain of interest of length N, where k (n) ∈ 1, p and c (n) ∈ R dk (n) . At each RJMCMC iteration, only local moves are possible from the current state (k (n) , c (n) ) to a proposal state (k , c ), in the sense that the proposal state should only differ from the current state by at most one covariate. Hence, c (n) ∈ R dk (n) and c ∈ R dk may be in different spaces (k = k (n) ). Two auxiliary vectors v 1 ∈ R d 1 and v 2 ∈ R d 2 with d 1 , d 2 ≥ 1 are needed to compensate for this dimensional difference, i.e., satisfying the dimension matching condition introduced by Green (1995) such that the pairs (v 1 , c (n) ) and (v 2 , c ) are of analogous dimension. This condition is a preliminary to the detailed balance condition that ensures that the Gibbs quasi-posteriorρ t is the invariant distribution of the Markov chain. The structure of PACBO is presented in Figure 1. Figure 1: General structure of PACBO.
where C −1 τ k denotes a normalizing constant. Let us now detail the proposal mechanism. First, a local move from k (n) to k is proposed by choosing k ∈ k (n) − 1, k (n) + 1 with probability q(k (n) , ·). Next, choosing d 1 = dk , d 2 = dk (n) , we sample v 1 from ρ k in (15). Finally, the pair (v 2 , c ) is obtained by where g : (x, y) ∈ R dk × R dk (n) → (y, x) ∈ R dk (n) × R dk is a one-to-one, first order derivative mapping. The resulting RJMCMC acceptance probability is , since the determinant of the Jacobian matrix of g is 1. The resulting PACBO algorithm is described in Algorithm 3.

Convergence of PACBO towards the Gibbs quasi-posterior
We prove that Algorithm 3 builds a Markov chain whose invariant distribution is precisely the Gibbs quasi-posterior as N goes to +∞. To do so, we need to prove that the chain iŝ ρ t -irreducible, aperiodic and Harris recurrent, see Robert and Casella (2004, Theorem 6.51) and Roberts and Rosenthal (2006, Theorem 20).

9:
Let (v 2 , c ) = g(v 1 , c (n) ). 10: Accept the move (k (n) , c (n) ) = (k , c ) with probability Else (k (n+1) , c (n+1) ) = (k (n) , c (n) ). 12: End for 13: Letĉ t = c (N) . 14: End for Recall that at each RJMCMC iteration in Algorithm 3, the chain is said to propose a "between model move" if k = k (n) and a "within model move" if k = k (n) and c = c (n) . The following result gives a sufficient condition for the chain to be Harris recurrent.
Lemma 1 Let D be the event that no "within-model move" is ever accepted and E be the support ofρ t . Then the chain generated by Algorithm 3 satisfies P D| k (0) , c (0) = (k, c) = 0, for any k ∈ 1, p and c ∈ R dk ∩ E .
Lemma 1 states that the chain must eventually accept a "within-model move". It remains true for other choices of q(k (n) , ·) in Algorithm 3, provided that the stationarity ofρ t is preserved.
Theorem 3 Let E denote the support ofρ t . Then for any c (0) ∈ E , the chain c (n) 1:N generated by Algorithm 3 isρ t -irreducible, aperiodic and Harris recurrent.
Theorem 3 legitimates our approximation PACBO to perform online clustering, since it asymptotically mimics the behavior of the computationally unavailableρ t . To the best of our knowledge, this kind of guarantee is original in the PAC-Bayesian literature.
Finally, let us stress that obtaining an explicit rate of convergence is beyond the scope of the present work. However, in most cases the chain converges rather quickly in practice, as illustrated by Figure 2. At time t, we advocate for setting k (0) as k (N) from round t − 1, as a warm start.

Numerical study
This section is devoted to the illustration of the potential of our quasi-Bayesian approach on synthetic data. Let us stress that all experiments are reproducible, thanks to the PACBO R package (Li, 2016). We do not claim to be exhaustive here but rather show the (good) behavior of our implementation on a toy example.

CALIBRATION OF PARAMETERS AND MIXING PROPERTIES
We set R to be the maximum 2 -norm of the observations. Note that a too small value will yield acceptance ratios to be close to zero and will degrade the mixing of the chain. As advised by the theory, we advise to set λ t = 0.6×(d +2)/(2 t). Recall that large values will enforce the quasi-posterior to account more for past data, whereas small values make the quasi-posterior alike the prior. We illustrate in Figure 2 the mixing behavior of PACBO. The convergence occurs quickly, and the default length of the RJMCMC runs is set to 500 in the PACBO package: this was a ceiling value in all our simulations.

ONLINE CLUSTERING
A large variety of methods have been proposed in the literature for selecting the number k of clusters in batch clustering (see Milligan and Cooper, 1985;Gordon, 1999, for a survey). These methods may be of local or global nature. For local methods, at each step, each cluster is either merged with another one, split in two or remains. Global methods evaluate the empirical distortion of any clustering as a function of the number k of cells over the whole dataset, and select the minimizer of this distortion. The rule of Hartigan (1975) is a wellknown representative of local methods. Popular global methods include the works of Calinski and Harabasz (1974), Krzanowski and Lai (1988) and Kaufman and Rousseeuw (1990), where functions based on the empirical distortion or on the average of within-cluster dispersion of each point are constructed and the optimal number of clusters is the maximizer of these functions. In addition, the Gap Statistic (Tibshirani et al., 2001) compares the change in within-cluster dispersion with the one expected under an appropriate reference null distri-  bution. More recently, CAPUSHE (CAlibrating Penalty Using Slope Heuristics) introduced by Fischer (2011) and Baudry et al. (2012) addresses the problem from the penalized model selection perspective, in the form of two methods: DDSE (Data-Driven Slope Estimation) and Djump (Dimension jump). R packages implementing those methods are used with their default parameters in our simulations.
However let us stress here that none of the aforementioned methods is specifically designed for online clustering. Indeed, to the best of our knowledge PACBO is the sole procedure that explicitly takes advantage of the sequential nature of data. For that reason, we present below the behavior and a comparison of running times between PACBO and the aforementioned methods, on the following synthetic online clustering toy example.
Model: 10 mixed groups in dimension 2. Observations (x t ) t=1,...,T=200 are simulated in the following way: define firstly for each t ∈ 1, T a pair (c 1,t , c 2,t ) ∈ R 2 , where c 1,t = − 5 2 π + 5π 9 t−1 20 − 1 and c 2,t = 5 sin(c 1,t ). Then for t ∈ 1, 100 , x t is sampled from a uniform distribution on the unit cube in R 2 , centered at (c x,t , c y,t ). For t ∈ 101, 200 , x t is generated by a bivariate Gaussian distribution, centered at (c x,t , c y,t ) with identity covariance matrix.
In this online setting, the true number k t of groups will augment of 1 unit every 20 time steps to eventually reach 10 (and the maximal number of clusters is set to 20 for all methods). Figure 3a shows ECL for PACBO and OCL along with 95% confidence intervals computed on 100 realizations with T = 200 observations, with λ t = 0.6 × (d + 2)/2 t and R = 15 (so that all observations are in the 2 -ball B 2 (R). Jumps in the ECL occur when new clusters of data are observed. Since PACBO outputs a partition based only on the past observations, the instantaneous loss is larger whenever a new cluster appears. However PACBO quickly identifies the new cluster. This is also supported by Figure 3b which represents the true and estimated numbers of clusters.
In addition we also count the number of correct estimations of the true number k t of clusters. Table 1 contains its mean (and standard deviation, on 100 repetitions) for PACBO and its seven competitors. PACBO has the largest mean by a significant margin and identifies the correct number of clusters of about 120 observations out of 200. Calinski Hartigan  Next, we compare the running times of PACBO and its competitors, in the online setting. At each time t = 1, . . . , 200, we measure the running time of each method. Table 2 presents the mean (and standard deviation) on 100 repetitions of the total running times. The superiority of PACBO is a straightforward consequence of the fact that it adapts to the sequential nature of data, whereas all other methods conduct a batch clustering at each time step. Calinski Hartigan  For the sake of completion, Appendix A contains an instance of the performance of all methods to estimate the true number of clusters.

Proofs
This section contains the proofs to all original results claimed in Section 3 and Section 4.

Proof of Corollary 1
Let us first introduce some notation. For any k ∈ 1, p and R > 0, let We denote by ρ k (c, c,ξ) the density consisting in the product of k independent uniform distributions on 2 -balls in R d , namely, where c ∈ C (k, R), ξ ∈ Ξ(k, R) and B d (c j , ξ j ) is an 2 -ball in R d , centered in c j with radius ξ j . In the following, we will shorten ρ k (c, c,ξ) to ρ k when no confusion can arise. The proof relies on choosing a specific ρ in Proposition 1. For any k ∈ 1, p , c ∈ C (k, R) and ξ ∈ Ξ(k, R), let Then ρ is a well-defined distribution on C and belongs to P π (C ). Proposition 1 For any ρ = ρ k 1 {c∈R dk } , the first term on the right-hand side of (16) satisfies Let us now compute the second term on the right-hand side of (16).

Proof of Theorem 1
The proof builds upon the online variance inequality described in Audibert (2009), i.e., for any λ > 0, anyρ ∈ P π (C ) and any x ∈ R d , By (22), we have Applying Jensen's inequality, for any 1 ≤ t ≤ T, and by (23), (24) and the duality formula (3), we have which achieves the proof.

Proof of Corollary 3
The proof is similar to the proof of Corollary 1, the only difference lies in the fact that (20) is replaced with

Proof of Theorem 2
The proof for the upper bound is straightforward: by replacing the loss function (c, x) by the penalized loss α (c, x) = (c, x) + α|c| with α = log T/ T in the proof of Theorem 1, we obtain and choosing λ = 1/ T and p = T 1 4 yields the desired upper bound.
We now proceed to the proof of the lower bound. The trick is to replace the supremum over the (x t ) in V T (s) by an expectation.
We first introduce the event Ω s,R = (X 1 , . . . , X T ) ∈ R dT : such that c T,R = s , where c T,R is defined as in a. Then, we have where µ T ∈ P (R dT ) is the joint distribution of i.i.d. sample (X 1 , . . . , X T ). Now, we have to choose µ in order to maximize the right-hand side of the above inequality. This is the purpose of the following lemmas.
The proof of Lemma 2 is similar to Bartlett et al. (1998, Section III.A, step 3). The next lemma controls the probability of the event |c T,R | = s with a proper choice of ∆ 2 and A in the definition of µ.
Lemma 3 Let s ∈ N * , 2 ≤ s ≤ p, and µ is defined in Lemma 2. Then, if we choose A = 2s + 1 and 2(s − 1)s log T then for any > 0 and T > 8s 2 log 2s 2 , we have P c T,R = s ≤ .
Proof For any k ∈ 1, p , let c T,k firstly denote the optimal partition in C (k, R) that minimizes the penalized empirical loss on (X 1 , . . . , X T ), i.e., In addition, denote by c µ,k the partition minimizing the expected penalized loss, i.e., One can notice that in fact |c| = k in the two above definitions for any c ∈ C (k, R) ∈ R dk . Next where the first inequality is induced by the definition of c T,R and the third inequality is due to the fact that we have almost surely In order to control the probability P(|c T,R | < s), let us first consider the Voronoi partition of R d induced by the set of points {z i , z i + w, i = 1, . . . , s} and for each i define V i as the union of the Voronoi cells belonging to z i and z i + w. Let N i denotes the number of X t , t = 1, . . . , T falling in V i . Hence (N 1 , . . . , N s ) follows a multinomial distribution with parameter (T, q 1 , q 2 , . . . , q s ), where q 1 = q 2 = · · · = q s = 1/s. Then The third inequality is due to the fact that T t=1 c T,k , X t ≥ min i=1,...,s N i (A −1) 2 ∆ 2 for k < s, and the last inequality holds since the marginal distributions of the N i s (i = 1, . . . , s) are the same binomial distribution with parameter (T, 1/s). Finally, we can bound the last term by Hoeffding's inequality, i.e., for any t > 0 Hoeffding's inequality implies that if s > 2, A = 2s + 1, T > 8s 2 log 2s 2 and ∆ 2 > Next, we proceed to the proof of Theorem 2. First of all, since (X 1 , . . . , X T ) are i.i.d, following the distribution µ and by the definition of Ω s,R , we can write whereĉ in the first inequality is given byĉ = arg inf c∈C E µ T (c, X t ) + |c| log T/ T 1(Ω s,R ) . Note thatĉ does not depend on t since µ is a symmetric uniform distribution (definition in Lemma 2). The second inequality is due to Jensen's inequality and the fourth inequality relies on the fact that with the definition of c T,R and µ, we have almost surely that where ∆ > 0 is related with the choice of µ in Lemma 2 and its value is constrained according to Lemma 3. Then we obtain for any > 0 Moreover, by Jensen's inequality Combining (26) and (27), we obtain Furthermore, by taking = 1/T and choosing the minimum value of ∆ 2 allowed in Lemma 3, (28) yields Finally, we need to ensure that s pairs of points {z i , z i + w} can be packed in B d (R) such that the distance between any two of the z i s is at least 2A. A sufficient condition (Kolmogorov and Tikhomirov, 1961) is If ∆ ≤ R/6 (which is satisfied if T is large enough), the above inequality holds if s ≤ R 3A∆ d As A = 2s + 1 and ∆ 2 < log T/ T, we get the desired result.

Proof of Lemma 1
Let D n denote the event that no "within-model move" is ever accepted in the first n moves. Then D 1 = D within 1 ∪ D between 1 , where D within 1 stands for the event that a "within-model move" is proposed but rejected in one step and D between 1 that a "between-model move" is proposed in one step. Then we have P D 1 |(k (0) , c (0) ) = (k, c) =P k = k|(k, c) + P k = k, but rejected|(k, c) Under the assumption of k = k, we have that c , c ∈ R dk , therefore the restriction ofρ t to R dk is well defined. Moreover, by the definition of π k in (7), the support of the restriction of ρ t to R dk is R dk ∩ E = (B d (2R)) k . Hence the function (c , c) → h t c |(k, c) is strictly positive and continuous on the compact set (B d (2R)) k × (B d (2R)) k . As a consequence, the minimum of h t c |(k, c) on (B d (2R)) k × (B d (2R)) k is achieved and we denote it by m k , i.e., In addition, due to the continuity and positivity of ρ k on R dk , it is clear that for any k ∈ 1, p Therefore, for any k, Hence, uniformly on k ∈ 1, p and c ∈ R dk ∩ E , we have, To conclude,

Proof of Theorem 3
For any c ∈ E , there exists some k ∈ 1, p such that c ∈ (B d (2R)) k ⊂ E . For any k ∈ k−1, k+1 and for any A ∈ B R dk such thatρ t (A) > 0, the transition kernel H of the chain is given by where ρ k (·, c k , τ k ) is the multivariate Student distribution in (15) and is the probability of rejection when starting at state c, and δ c (·) is a Dirac measure in c.
For the second term in the right-hand side of (37), by Lemma 4, which concludes the proof.
Tuning parameters λ, τ and η can be chosen to obtain a sublinear regret bound for the cumulative loss of Algorithm 1.

Proof
The proof is similar to the proof of Corollary 4, the only difference lies in the fact that (40) is replaced by For the sake of completion, we present in Figure 4 the performance of PACBO and its seven competitors for estimating the true number k t of clusters along time. We acknowledge that no theoretical guarantee is derived for the estimation of k t yet the practical behavior is remarkable.