Posterior Convergence and Model Estimation in Bayesian Change-point Problems

We study the posterior distribution of the Bayesian multiple change-point regression problem when the number and the locations of the change-points are unknown. While it is relatively easy to apply the general theory to obtain the $O(1/\sqrt{n})$ rate up to some logarithmic factor, showing the exact parametric rate of convergence of the posterior distribution requires additional work and assumptions. Additionally, we demonstrate the asymptotic normality of the segment levels under these assumptions. For inferences on the number of change-points, we show that the Bayesian approach can produce a consistent posterior estimate. Finally, we argue that the point-wise posterior convergence property as demonstrated might have bad finite sample performance in that consistent posterior for model selection necessarily implies the maximal squared risk will be asymptotically larger than the optimal $O(1/\sqrt{n})$ rate. This is the Bayesian version of the same phenomenon that has been noted and studied by other authors.


Introduction
We consider the regression problem of estimating a piece-wise constant function when the number of segments as well as the locations of its change-points is unknown. This is an old problem that has attracted much attention recently (Goldenshluger et al., 2006;Ben Hariz et al., 2007;Fearnhead, 2008). Applications of multiple change-point models surged after efficient computations using reversible jump MCMC was discovered (Green, 1995). (Green, 1995) applied piece-wise constant function in the study of the coal mining disaster data in the context of Poisson process. A more recent trend of analysis that dispenses with the usage of MCMC for the change-points problem starts with the paper Liu and Lawrence (1999) where a dynamic programming approach is utilized to marginalize over segment levels and change-point locations. Their original motivation comes from the problem of partitioning DNA sequences into homogeneous segments. This dynamic programming approach is later extended by Fearnhead (2006); Lian (2008).
Unlike the above studies, in this paper we are only concerned with the asymptotic properties of Bayesian multiple change-point problems and investigate from the frequentist view the posterior contraction characteristics of a simplified model. Although a piece-wise constant function involves only a finite number of parameters, as we will only consider the case where an upper bound on the number of change-points is available a priori, it is nevertheless best studied from a infinite-dimensional viewpoint and put the estimation problem in the context of function spaces. Until recently, little is known about the behavior of the posterior distribution of infinite-dimensional models. For consistency issues, Schwartz (1965) shows that the posterior is consistent when certain tests can be established for the true distribution versus the complement of its neighborhood. Barron et al. (1999) further developed the theory by sieve construction and metric entropy bounds. Convergence rates are studied in two independent and to some extent overlapping but complementary works (Ghosal et al. (2000) and Shen and Wasserman (2001)). In particular, Ghosal et al. (2000) extends the idea of constructing suitable tests in order to bound the convergence rates for both nonparametric and parametric problems, and Ghosal and Van Der Vaart (2007) further extends the approach to non-i.i.d. observations. The existence of tests for many specific problems can be found in the existing literature although sometimes new tests need to be carefully designed.
In nonparametric Bayesian analysis, we have an i.i.d. sample Z 1 , . . . , Z n from the distribution P 0 with density p 0 with respect to some measure on the sample space (Z, B). The model space is denoted by P which is known to contain the true distribution P 0 . Given some prior distribution Π on P, the posterior is a random measure given by .
For ease of notation, we will omit the explicit conditioning and write Π n (A) for the posterior distribution. We say that the posterior is consistent if Π n (P ∈ P : d(P, P 0 ) > ǫ) → 0 in P n 0 probability for any ǫ > 0, where d is some suitable distance function between probability measures.
To study rates of convergence, let ǫ n be a sequence decreasing to zero, we say the rate is at least ǫ n if for sufficiently large constant M Π n (P : d(P, P 0 ) > M ǫ n ) → 0 in P n 0 probability.
We also need a slightly weaker definition of rates of convergence by replacing M with a sequence M n and requiring that the above posterior mass converge to zero for any sequence M n that diverges to infinity. This definition is usually required in parametric problems to get rid of the extra log n factor in the convergence rates.
In our regression problem, we observe an i.i.d. sample Z = (Z 1 , . . . , Z n ) with the distribution of Z i = (X i , Y i ) defined structurally by . and θ 0 is a piece-wise constant function on [0, 1) with unknown locations of change-points. We can write θ 0 (t) = k0 j=1 a j I(t j−1 ≤ t < t j ), t 0 = 0 < t 1 < t 2 < . . . < t k0 = 1 using the indicator function and thus θ 0 is parameterized by (a, t), a = (a 1 , . . . , a k0 ), t = (t 0 , . . . , t k0 ). For simplicity, we assume the marginal distributions for {X i } are i.i.d uniform on [0, 1), and note that it is straightforward to extend all the following results to any distribution of X with density bounded away from zero and infinity. Note P 0 is fully determined by θ 0 under these assumptions, and thus we also use the space of piece-wise constant functions as our model space which is equivalent to using P. The measure induced by θ is denoted by P θ , and thus P θ0 is the same as P 0 , the true distribution. Lian (2007a) studied the consistency issue for the above model with the exception that there X i 's are deterministically chosen on a grid. In that paper, consistency is proved for the case that the true regression function is in the Lipschitz class as well. Another related work is Scricciolo (2007) where the Bayesian density estimation problem is studied with density approximated by piece-wise constant functions. Besides the fact that they are interested in density estimation instead of regression, the focus of that paper is very different from the current one. They are mostly concerned with the case of approximating a smooth density using step functions and aim to achieve the optimal rates up to a logarithmic factor. For density functions that are piece-wise constant, they prove parametric rate of convergence also with an extra logarithmic factor. A diverging number of grid points is used and thus this approach cannot be used to estimate the number of segments when the density is truly piece-wise constant.
One simplification of our model compared with those works mentioned at the beginning of this section that focus on the computational issues is that the variance of the noise is assumed to be known here (and actually 1 without loss of generality). Investigations of Bayesian regression with unknown noise levels can run into additional technical difficulties especially in the design of appropriate tests. Consistency of a regression problem with unknown noise level is addressed in Choi and Schervish (2007). We hope to be able to address this problem in our context in a future paper.
In this paper, we focus on the case θ 0 is piece-wise constant and aim to achieve the exact parametric O(1/ √ n) rates of convergence and also study the posterior consistency in the estimation of the number of change-points, which we refer to as the model selection problem. The proofs for the estimation rates involve direct application of general theorems in Ghosal et al. (2000) but the calculation of the covering number is nontrivial in this case. In order to achieve the exact parametric rate, an additional assumption needs to be made to exclude functions with segment lengths that are too short.

Main results
Consider the case where we have some a priori bounds for the number of change-points as well as for the segment levels {a j }. The model space is defined as By convention, we say θ with t k = 1 has k change-points, which is the same as the number of segments.
Another equivalent representation of Θ is where T k is the set of (k + 1)-tuples (t 0 , . . . , t k ) with t j < t j+1 . We will not distinguish between these two different representations and θ can denote either a function or the tuple (a, t). This ambiguity can always be resolved by the context.
For rates of convergence, the distance d we use is the L 2 norm of the function ||θ|| = . Since we only consider uniformly bounded functions, the L 2 norm is equivalent to the Hellinger distance (e.g. Ghosal and Van Der Vaart (2007), section 7.7).
We now specify a prior on Θ using a hierarchical approach. Let Θ k be the subspace of Θ that consists of functions with k change-points, and the prior Π is specified as a mixture with Π k the prior measure on Θ k . We assume that Π k has a density π k (θ) which can be further decomposed as π k (a, t) = π a k (a|t)π t k (t) .
The assumption we make on the prior is that (A) The density π a k (a|t) and π x k (t) are bounded away from zero and infinity on [−K, K] k and T k respectively.
This assumption is satisfied, for example, when t 1 , t 2 , . . . , t k−1 are distributed as the order statistic of k−1 points uniform distributed on [0, 1) while segment levels are independent and uniformly distributed on [−K, K].
The first simple result shows that the posterior rate of convergence is n −1/2 up to a logarithmic factor.

Theorem 1. Under assumption (A), the posterior rate of convergence is at least
Theorem 1 considers the convergence rates of the estimation problem. A different problem is the convergence of the posterior for the number of change-points. Under no additional assumptions, we can show that the posterior probability will concentrate on the true number of change-points with probability converging to 1.

Theorem 2. Under the same assumption (A), we have
Nonparametric Bayesian model selection has been investigated in Ghosal et al. (2008). The focus of that paper is on conditions under which the adaptive rates are achieved when simultaneously considering models with different rates of contraction. Thus it seems the results presented there cannot be directly applied in our case.
To get rid of the extra logarithmic factor in Theorem 1, one would use the more refined Theorem 2.4 in Ghosal et al. (2000), using local covering number instead of the global one. Nevertheless, as shown by Lemma 4 in the appendix, the local covering number for Θ is not bounded as would be required if we set ǫ n = O(1/ √ n). Instead, we consider the smaller model space We can define Θ δ k in a similar way and assumption (A) can be modified accordingly. Specification of a prior on Θ δ is easy. Conceptually, we can just restrict π t k (t) to be supported Θ δ and renormalize the density. Reversible jump algorithms can be easily modified to take into account the constraint. Dynamic programming can also incorporate the pre-specified shortest possible segment length (Lian (2007b). Theorem 1 and Theorem 2 is still true on Θ δ with few modifications on the proofs. Green (1995) also noticed the practical advantage of avoiding short steps. They proposed using even-numbered order statistics from 2k − 1 uniformly distributed points so that short segment lengths are better penalized.
As shown in the appendix, putting some lower bound on the segment lengths makes the local covering number bounded by a constant. This requires a very detailed argument to construct the covering. Using this more refined bound on the covering number, we can achieve the exact parametric rate.
Theorem 3. For any δ > 0, under assumption (A), the posterior rate of convergence on Θ δ is at least Combination of Theorem 2 and 3 immediately gives us the rates of convergence for the change-point locations: Corollary 1. Under the same assumptions as above, the posterior convergence rate for the change-point locations is at least ǫ 2 probability. This rate of course agrees with many frequentist approaches, say using the cumulative sum.
It is well-known that the posterior distribution in regular parametric models conditionally converges to a Gaussian distribution under weak conditions. Since the previous results show that the number and locations of the change-points can be consistently estimated, one would naturally conjecture that the posterior distribution for segment levels will converge to a multivariate Gaussian distribution. This is indeed the case as stated in the following theorem: Theorem 4. Suppose the true segment lengths are l j = t 0 j − t 0 j−1 , j = 1, . . . , k 0 . Denoting the posterior distribution of a = (a 1 , . . . , a k ) restricted on the event k = k 0 (which has a posterior probability converging to 1) by Π n a|Z and the covariance matrix where ||P −Q|| T V is the total variation distance between probability measures P and Q,â(t 0 ) is the maximum likelihood estimator for a, assuming the true locations of the change-points are known.
The above theorems show that the Bayesian procedure possesses very good properties. On the one hand, the exact parametric rate is achieved for the estimation problem in the function space. On the other hand, the number of change-points can be consistently estimated. This is reminiscent of the recent literature on the oracle property of penalized estimators. As shown in Fan and Li (2001), the SCAD estimator, when the smoothing parameter is chosen appropriately, can estimate the zero coefficients in a linear regression model as exactly zero with probability converging to one as sample size increases. At the same time, the estimator is still consistent for nonzero coefficients and the asymptotic distribution is the same whether or not the correct zero positions are known. This is called the oracle property by Fan and Li (2001). More recently, Leeb and Potscher (2008) showed that the oracle property might be misleading in terms of the estimator's finite sample performance and it is impossible to adapt to the unknown zero restrictions without paying a price. The caveat lies in the point-wise nature of the asymptotic theory laid out in Fan and Li (2001). The authors of Leeb and Potscher (2008) show that an unbounded (normalized) risk results for any estimator possessing the sparsity property. Another related work is Yang (2005) where the author shows that AIC has a minimax property which cannot be shared with any model selection consistent estimators in a regression problem.
In our context, similar conclusion can be drawn for the Bayesian multiple change-point problem. Theorem 2 and Theorem exactrate apply to a fixed true piece-wise constant function and thus the convergence as stated is point-wise in nature. It is not difficult to see from the proof of Theorem 3 that the 1/ √ n rate is not actually uniform over the class Θ δ . The reason is that to obtain the bound for the local covering number (Lemma 5 in the appendix), the constants involved does depend on θ 0 . In particular, the derivation of the lemma requires a lower bound on the size of the jumps of the neighboring segments and thus the convergence is not uniform over Θ δ . Intuitively, small jumps makes the estimation more difficult and heavier penalization by the prior must be entertained (possibly by using a prior that depends on the sample size) to achieve model selection consistency at the cost of losing estimation accuracy. As seen in the proof of Theorem 5, the difficulty occurs when the size of the jump is of order O(1/ √ n), in which case it becomes difficult to detect the change-point.
Nevertheless, as discussed above, the convergence is uniform if we further restrict our attention on the sub-class: We state the uniform convergence as a proposition without proof: Proposition 1. For any fixed δ 1 , δ 2 > 0, the rate of convergence is uniformly at least

. The property of model selection consistency is still satisfied in this case.
On the other hand, the following result confirms that we cannot expect the posterior to converge uniformly over the class Θ δ if the method can adapt to the number of changepoints. Note that the theorem applies for any Bayesian posterior distribution for the changepoint problem, not just the specific prior we constructed.
Theorem 5. Suppose the posterior distribution satisfies the model consistency condition: Π n (k = k 0 ) → 1 in P n 0 probability, then the maximal L 2 convergence of θ is necessarily slower than the parametric rate ǫ n = O(1/ √ n). That is, for some M n → ∞, . The above theorem demonstrated the trade-off between function estimation and model selection for our Bayesian multiple change-point problems.

Discussion
In this paper, we investigated in detail some asymptotic properties of Bayesian multiple change-point problems when the noise level is assumed known. We proved estimation rate of convergence as well as model selection consistency of the posterior distribution. The main contribution of the paper is to show that the exact parametric rate is achieved for a restricted class of piece-wise constant functions and that this optimal rate cannot be achieved uniformly over the class.
Our theory still leaves some gaps in between. For example, it is still unknown whether it is absolutely necessary to restrict the functions to have not too short segment lengths in order to achieve the optimal rate. The additional restriction makes the local covering number bounded in order to apply the Theorem in Ghosal et al. (2000). Besides, the situation with unknown error level is of significant practical importance in which case one should also put a prior on the noise level. The convergence property in this case is still an open problem.

Some Lemmas
In preparation for the proofs of the main results, we first collect some lemmas here. Lemma 4 below shows that the local covering number is unbounded as remarked in the main text and is not used further in any other proofs. The constant C is used to denote a generic constant which might not be the same at different places. Note that since we are only considering uniformly bounded class of functions, the Hellinger distance, the Kullback-Leibler divergence, as well as the second moment of the likelihood ratio are all equivalent to the L 2 norm of the regression function. In the following, we set δ 0 = min{min j |t 0 j − t 0 j−1 |, min j |a 0 j − a 0 j−1 |} > 0, which bounds the segment lengths as well as the jump size from below.
Proof. Choose a grid on the domain [0, 1) and another grid on [−K, K] LetΘ = {θ ∈ Θ, θ jumps only at points in ∆ t and takes segment levels in ∆ y }. It is then easy to show thatΘ is an ǫ−covering of Θ with covering number bounded by The next lemma considers the local covering/packing number. In particular, Lemma 4 illustrates why we cannot apply Theorem 2.4 from Ghosal et al. (2000) to obtain the exact parametric rate on Θ.
Proof. Without loss of generality, we consider θ 0 = 0. We construct a lower bound for the packing number. For simplicity, we assume 1/(4ǫ 2 ) is an integer. Using the partition of the interval [0, and construct the piece-wise constant functions θ i (t) = I((i − 1)4ǫ 2 ≤ t < i · 4ǫ 2 ). Obviously, for this set of functions, we have ||θ i || = 2ǫ and ||θ i − θ j || = 2 √ 2ǫ. The lower bound for the covering number is obtained by the simple relationship between covering number and packing number.
for some constant C that depends on δ, δ 0 , K and k max but does not depend on ǫ.
Proof. Suppose that ||θ − θ 0 || ≤ 2ǫ. From the proof of Lemma 2, we know that each changepoint of θ 0 has a corresponding change-point of θ that satisfies |t t(j) − t 0 j | ≤ 16ǫ 2 /δ 2 0 . For any segment level a 0 j of θ 0 , denote the corresponding index of the segment of θ that has an overlap of at least δ/2 by r(j), by similar argument as Lemma 2, |a j − a 0 r(j) | ≤ 2 √ 2ǫ/ √ δ. To construct a covering, we partition [0, 1) into nonoverlapping intervals. In the following, M, B, N are sufficiently large integers to be chosen later. First, each interval [t 0 j − 16ǫ 2 /δ 2 0 , t 0 j + 16ǫ 2 /δ 2 0 ] is partitioned into M subintervals with equal lengths. For the rest of [0, 1) we partition it into segments of lengths between δ/2B and δ/B. Obviously the total number of subintervals does not depend on ǫ. These subintervals falls into two types: (i) the subinterval that contains some change-point of θ 0 ; (ii) the subinterval that is entirely contained in some segment of θ 0 . The function class F that forms a covering is defined as the set of functions which is piece-wise constant with respect to the partition, takes a value of 0 on type (i) subintervals and takes values of the form a 0 j + 2 √ 2ǫ N √ δ i, i = −N, −(N − 1), . . . , N, on type (ii) subintervals if the subinterval is contained in segment j of θ 0 . The size of F is a constant independent of ǫ and we show next that it is indeed a ǫ/2-covering.
On subintervals of type (i), the squared L 2 distance between F and θ restricted on these intervals are at most 16ǫ 2 Mδ 2 k max K 2 . Type (ii) subintervals can further be divided into three types: (iii) it contains a change-point of θ which is closest to some change-point of θ 0 ; (iv) it contains a change-point of θ other than those closest to some change-point of θ 0 ; (v) it is entirely contained in some segment of θ. On subintervals of type (iii) the squared distance is at most 16ǫ 2 Mδ 2 k max K 2 . On subintervals of type (iv) the squared distance is at Thus when M, B, N is large enough, we have a ǫ/2-covering.

Proofs of the main results
Proof of Theorem 1. We apply Theorem 2.1 in Ghosal et al. (2000) with ǫ n = C log n/n. Condition (2.2) for that theorem is verified in Lemma 3, condition (2.3) is trivially satisfied and condition (2.4) is verified in Lemmas 1.
Proof of Theorem 2. Theorem 1 immediately implies that the under-estimation probablity Π n (k < k 0 ) → 0 in P n 0 probability. For over-estimation, it is sufficient to show that is the prior measure on the locations of change-points. For any fixed t ∈ U n , with probability converging to 1, by considering a small neighborhood of the maximum likelihood estimatorâ(t) for the given t as in Laplace approximation, we have p n θ (Z) .
Since this is true for any M , it is also true for some slowly diverging sequence M n as in the statement of the theorem.