bcp : An R Package for Performing a Bayesian Analysis of Change Point Problems

Barry and Hartigan (1993) propose a Bayesian analysis for change point problems. We provide a brief summary of selected work on change point problems, both preceding and following Barry and Hartigan. We outline Barry and Hartigan’s approach and oﬀer a new R package, bcp (Erdman and Emerson 2007), implementing their analysis. We discuss two frequentist alternatives to the Bayesian analysis, the recursive circular binary segmentation algorithm (Olshen and Venkatraman 2004) and the dynamic programming algorithm of (Bai and Perron 2003). We illustrate the application of bcp with economic and microarray data from the literature.


Introduction
During the Great Depression, farmers were interested in estimating spatial changes in insect populations threatening crops -a change point problem. Although they discovered that the insect populations didn't change significantly within fields (Finney 1946), change point methods have since been applied to problems in economics, gynecology, and survival analysis, for example. More recently, certain types of microarray data are suitable for change point analysis.
Circular binary segmentation (CBS, Olshen and Venkatraman 2004), a modification of binary segmentation (Sen and Srivastava 1975), is a popular, recursive change point algorithm. Although it was designed for the analysis of microarray data, it is generally applicable to any change point problem. Both procedures assume normality, estimating locations of change points using a likelihood ratio statistic. The R implementation of CBS is available on Bioconductor (Gentleman et al. 2004). Another popular algorithm, breakpoints (BP, Bai and Perron 2003), is available in the R package strucchange (Zeileis et al. 2002(Zeileis et al. , 2003. For a given number of change points, breakpoints() uses least squares regression to estimate the locations of the changes. The function then selects an optimal model (choosing the number of change points) using the Bayesian information criterion (BIC) by default. We implement a Bayesian alternative to these procedures in the R package bcp -the approach of Barry and Hartigan (1993) based on a product partition model. Section 2 summarizes the work of Bai and Perron (2003), Olshen and Venkatraman (2004), and Barry and Hartigan (1993), and describes a numerical challenge posed by the implementation of Barry and Hartigan's procedure (BH). Section 3 describes the bcp package and illustrates its usage in a simulation study and applied to microarray and economic data. We conclude by discussing the strengths and weaknesses of these three procedures and directions for future work.

Methods
This paper offers a new R implementation of the Bayesian change point procedure proposed by Barry and Hartigan (1993). While frequentist procedures for change point analysis estimate specific locations of change points, the Bayesian procedure offers a probability distribution -the probability of a change point at each location in a sequence. To assist the reader, we begin by discussing two frequentist alternatives, of Bai and Perron (2003), and Olshen and Venkatraman (2004). We then describe the Bayesian approach and address a numerical challenge proposed by the implementation of BH. We use the notation of the authors whenever possible.

Bai and Perron's dynamic programming algorithm
Bai and Perron's method uses a dynamic programming algorithm to identify optimal partitions with varying numbers of segments. They provide options for both the minimum segment length, h, and the maximum number of breaks, m. For each possible number of breaks k ≤ m, they obtain the optimal break point locations by minimizing the within-segment sums of squares. The R implementation of BP reports the partition achieving the lowest BIC by default; users may also compute the log likelihood and Akaike information criterion (AIC) of partitions using the logLik() and AIC() functions.
We use Bai and Perron's pure structural change model for simple change point examples and simulations in Section 3.3. Readers interested in the more general framework of Bai and Perron's method (a structural change model with covariates) are encouraged to see Bai and Perron (2003).

Recursive binary and circular binary segmentation
Both binary segmentation and circular binary segmentation consider observations, X 1 , ..., X n , ordered in time or space. Let S i = X 1 + · · · + X i , for 1 ≤ i ≤ n, denote the partial sums of the observations. By assuming the data are assumed normally distributed with a known variance, binary segmentation uses the likelihood ratio statistic for testing the null hypothesis of no change point against the alternative of exactly one change point at an unknown location i (Sen and Srivastava 1975). The likelihood ratio statistic is given by Z B = max|Z i |, where for 1 ≤ i < n. The null hypothesis is rejected if this statistic exceeds the upper α th quantile of the null distribution of Z B , in which case the location of the change point is estimated as that i for which Z B = |Z i | (Olshen and Venkatraman 2004). The test is applied recursively until no changes are detected in any of the segments of the partition.
Because the binary segmentation algorithm is based on a test to detect a single change, it may have trouble detecting a small segment buried in the middle of a large segment (Olshen and Venkatraman 2004). Olshen and Venkatraman proposed the circular binary segmentation algorithm to address this problem. Using CBS, each segment under consideration is connected at the two ends, forming a circle. The likelihood ratio statistic for testing the hypothesis that the arc extending from i + 1 to j and its complement have different means is given by for 1 ≤ i < j ≤ n. This modification of binary segmentation is based on the statistic Z C = max|Z ij |, for 1 ≤ i < j ≤ n, and rejects the null hypothesis if this statistic exceeds an appropriate threshold based on the null distribution of Z C . Note that this procedure allows either a single change (j = n) or two changes (j < n). If the null hypothesis is rejected, the change points are estimated to be i and j such that Z C = |Z ij | and, again, the procedure is applied recursively to the resulting sub-segments until no additional changes are detected. Unlike BP, there is no guarantee that Binary or circular binary segmentation achieves the optimal change point locations for the recommended number of changes.

Barry and Hartigan
As in CBS, Barry and Hartigan assume that the observations are independent N (µ i , σ 2 ), and that the probability of a change point at a position i is p, independently at each i. However, the assumption of independent observations could be weakened "because all that is required is that, given the partition and the parameters, observations in different blocks are mutually independent" (Barry and Hartigan 1993, p. 310). The prior distribution of µ ij (the mean of the block beginning at position i + 1 and ending at position j) is chosen as N (µ 0 , σ 2 0 /(j − i)). This choice of prior allows "weak signals provided that there are sufficient data to estimate them" (Barry and Hartigan 1993, p. 311). Although an exact implementation of Barry and Hartigan's Bayes procedure is possible, the calculations are O(n 3 ); we implement an MCMC approximation that is O(n 2 ). The reader is encouraged to refer to Barry and Hartigan (1993) for the full theoretical framework, and our presentation will conform to their notation.
The algorithm uses a partition ρ = (U 1 , U 2 , ..., U n ), where U i = 1 indicates a change point at position i + 1; we initialize U i to 0 for all i < n, with U n ≡ 1. In each step of the Markov chain, at each position i, a value of U i is drawn from the conditional distribution of U i given the data and the current partition. Following Barry and Hartigan, we let b denote the number of blocks obtained if U i = 0, conditional on U j , for i = j. The transition probability, p, for the conditional probability of a change point at the position i + 1, may be obtained from the simplified ratio presented in Barry and Hartigan: where W 0 , B 0 , W 1 and B 1 are the within and between block sums of squares obtained when U i = 0 and U i = 1 respectively, and X is the data. The tuning parameters γ and λ may take values in [0, 1], chosen so that this method "is effective in situations where there aren't too many changes (γ small), and where the changes that do occur are of a reasonable size (λ small)" (Barry and Hartigan 1993, p. 312). After each iteration, the posterior means are updated conditional on the current partition. A direct implementation of the BH MCMC algorithm is numerically unstable for long sequences because the integrands of either diverge or go to 0 for long sequences. Fortunately, these integrals can be simplified as incomplete beta integrals. The odds of a change point at a particular position in the partition (given the data and the current partition) may be re-expressed as This expression consists of numerically stable terms, allowing application of the BH procedure to sequences of any length. The MCMC implementation of BH estimates the posterior distributions of the change points and the means, µ ij .

Package bcp: Examples and simulations
Package bcp contains the main bcp() function, five methods (summary(), print(), plot(), fitted(), and residuals()), and two datasets. The function bcp() performs the BH analysis, taking six arguments: x: a numerical vector of data.
p0 and w0: optional values for Barry and Haritgan's hyperparameters γ and λ; these default to the value 0.2, which has been found to work well (see Yao (1984) and Barry and Hartigan (1993)).
burnin: optional number of "burn-in" iterations that are excluded from the estimation of the posterior means and probabilities of changes. The chain settles very quickly in practice, and the default is 50.
mcmc: optional number of iterations used in the estimation of the posterior means; the default is 500.
return.mcmc: if TRUE, returns the partition and the associated conditional posterior means for each iteration; the default is set to FALSE and returns a summary of the chain.
After completing the analysis, bcp() returns an object of class "bcp" containing the following components: data: a copy of the data.
mcmc.means: if return.mcmc=TRUE, contains the posterior means conditional on the current partition at the end of every iteration, otherwise mcmc.means is NA.
mcmc.rhos: if return.mcmc=TRUE, contains the partition after each iteration, otherwise mcmc.rhos is NA.
blocks: a vector of the number of blocks after each iteration.
posterior.mean: a vector containing the posterior means.
posterior.var: a vector containing the "naive" posterior variance for each postition, over the mcmc iterations.
posterior.prob: a vector containing the posterior probability of a change point at each position.
Package bcp contains five methods and two datasets:: Methods: -The plot() method provides two plots summarizing the analysis. The first figure, "Posterior Means", displays the data along with the posterior mean of each position. The second figure, "Posterior Probability of a Change", shows the proportion of iterations resulting in a change point at each position.
-The summary() and print() methods, modeled after the the summary() method for "mcmc" objects (Plummer et al. 2007), print a table of the posterior probability of a change in mean, along with the posterior mean and standard deviation for each position. After removing the burnin iterations, the mcmc.means component of a "bcp" object may be converted into an "mcmc" object to view a full "mcmc" summary, and to perform convergence tests. An example of this conversion is given in the bcp package documentation (Erdman and Emerson 2007).
-The fitted() method returns the vector of posterior means.
-The residuals() method returns the data minus the posterior means. Data: -Coriell: two array Comparative Genomic Hybridization (CGH) studies of Coriell cell lines, also appears in the DNAcopy package (Venkatraman and Olshen 2006) that performs CBS, taken originally from Snijders et al. (2001). -RealInt: US interest rate data, considered by Bai and Perron (2003) and Garcia and Perron (1996), also appears in the strucchange package that performs BP (Zeileis et al. 2002(Zeileis et al. , 2003. We consider these examples from the literature in Sections 3.1 and 3.2, and conclude with a simulation study mirroring the one presented in Barry and Hartigan (1993). In all cases, we allow the BP algorithm to consider all possible partitions (the default is a minimum segment length of 15), and we use the hyperparameters recommended by Barry and Hartigan; we do not "tune" any of the parameters for the examples or simulations.

Example: Coriell cell lines
The CGH studies of the Coriell cell lines were used by Venkatraman and Olshen (2006) to demonstrate CBS in DNAcopy, and by Fridlyand et al. (2004) to demonstrate their hidden markov model approach. CGH was developed as a method for detecting and mapping chromosomal aberrations in the genome (Kallioniemi et al. 1992). The procedure begins by dying tumor and normal tissue with different fluorochromes (usually red and green). If there have been no chromosomal aberrations in the tumor DNA, the mixture of the two samples will emit a yellow fluorescence. However, if there have been amplifications or deletions in the tumor sample, the mixture will emit a red or green fluorescence depending on the color with which the normal tissue was dyed. The fluorescence is then translated into DNA copy number. Fridlyand et al. (2004) describes the Coriell copy number data as consisting "of 15 fibroblast cell lines containing cytogenetically mapped partial or whole-chromosome aneuploidy, and each array contained 2276 mapped BACs spotted in triplicate." UCSF SPOT software (Jain et al. 2002) was used to calculate the log 2 ratio ratios of the red-green signal intensities for each spot on the arrays, and to perform local background correction. A commercial program, SPROC, was used to map the data to its location in the genome, and as a filter to remove measurements based on a number of criteria including low reference/signal intensity (Snijders et al. 2001). Finally, the data were edited to remove measurements for which only one of the triplicates remained after the SPROC filter and/or the standard deviation of the triplicates was > 0.2 (Snijders et al. 2001).
When applied to chromosome 11 of Coriell.05296, for example, BP and CBS tend to agree on the location of the changes in copy number (and thus give identical copy number estimates). BH gives similar estimates when there are long blocks of roughly equal copy number, but gives very different estimates when there are shorter blocks and/or "outliers". Figure 1 shows that the BH estimates match the BP and CBS estimates well, except in the middle "block" where BH detects more activity. Short-lived changes, like those in the middle block, can be due to cross-hybridization, false signals (a speck on the array), or a true signal (a deletion perhaps on a single strand of the DNA). Because these clones were spotted in triplicate, a false signal is unlikely. The Bayesian approach clearly acknowledges this area of uncertainty, while the frequentist procedures only identify the two major breaks.

Example: Interest rate time series
The RealInt dataset may be found in the strucchange package (Zeileis et al. 2002(Zeileis et al. , 2003 and was examined in Bai and Perron (2003) and Garcia and Perron (1996). Garcia and Perron use the data "to assess if the ex-ante real interest rate is constant, at least over some long enough periods, or if it exhibits nonstationary behavior" (Garcia and Perron 1996, p. 111). RealInt contains 103 quarterly observations of the US ex-post real interest rate from 1961(1) to 1986(3). Figure 2 shows that BH and BP give similar estimates, detecting changes in 1972(3), 1979(4), 1981(2). Although BH is less sure of the change identified by BP around the fourth quarter of 1982, there is still a notable spike in the posterior probability of this change. CBS also identifies the first interest rate shift at the end of 1972, but splits BH and BP's third block (from 1979(4) -1981(2)), giving a 3-block model.
Garcia and Perron note that "although some of the local optima seem to correspond to important economic events such as the change in the Federal Reserve operating procedures between the end of 1979 and 1982 or the rise in inflation in 1973, the global minimum does not have any ready economic interpretation" (Garcia and Perron 1996, p. 119). The authors use the Markov switching model of Hamilton (1989) and identify mean interest rate shifts in the beginning of 1973 and in mid-1981. Bai and Perron (2003) apply breakpoints() to RealInt with parameters m=5 and h=15 (i.e., they require at least 15 observations per segment, and restrict their search to partitions with 5 or less shifts). They apply a sequential supF test of k vs. k + 1 breaks, and select the model with changes in 1966(4), 1972(3), and 1980 (3). Because Bai and Perron tuned the parameters to suit their interpretation of the problem, their published result differs slightly from the one presented here.

A simulation study
We consider the same 20 "scenes" used by Barry and Hartigan (a scene is a partition together with a set of means for the segments of the partition). For each scene, 60 observations are randomly drawn from one or more normal distributions with specified mean(s) and variance 1. We follow Barry and Hartigan's notation in describing the scenes. For example, the sequence 10 0 15 1 20 2 15 1 denotes a scene with four blocks of lengths 10, 15, 20, and 15 drawn from N (0, 1), N (1, 1), N (2, 1), and N (1, 1) respectively. For each of the 20 scenes, we simulate 1000 data sets, and apply CBS, BP, and BH to each of them. The differences in mean squared error and their standard errors are presented in Table 1. Specifically, for each data set, we calculate the mean squared errors for each method -the mean of the squared differences between the pointwise estimates and the true means. Thus, for each scene, we obtain 1000 MSEs for each method. Lastly, we calculate the mean and standard errors of the differences in these MSEs between methods, using BH as the baseline. Table 1 shows that the BH algorithm outperforms CBS and BP in MSE in most of the scenes. These differences in MSE are especially pronounced when there are shorter blocks and/or outliers. For example, in Figure 3 scene 20, both CBS and BP fail to detect any of the regular, smaller changes, and BH performs significantly better than both methods. Also, in scene 18 with relatively large, short-lived changes, BH performs much better than CBS. In scenes 2 and 7, having two and three blocks respectively, CBS only does slightly better in MSE than BH, and BH does only slightly better than BP. In Figure 3 we see that all three methods do a reasonably good job at estimating the true block means for these scenes.

Conclusion
The current version, 1.8.4, of bcp (which uses R and C) is available for the R system for statistical computing (R Development Core Team    seconds for CBS and 3.62 seconds for BP. Our MCMC of the BH procedure is of O(n 2 ) in speed and O(n) in memory. A future version of bcp may be of O(n) in speed. The analysis of microarray data with, for example, 10,000 measurements currently takes approximately 45 minutes. When allowed to look over all partitions, the least squares calculations of the BP dynamic programming algorithm are of order O(n 3 ) in speed and O(n 2 ) in memory. Thus, the unconstrained BP procedure is not feasible for large sample sizes. The CBS algorithm is of O(n) in both speed and memory, and takes approximately 10 minutes for a sequence of length 10,000.
BH and CBS have a clear speed and memory advantages over BP which is of essential importance with large data sets. For smaller data sets without memory constraints, however, the BP dynamic programming algorithm is guaranteed to find the optimal least squares solution, while the recursive CBS algorithm is not. Furthermore, a C implementation of the dynamic programming algorithm (designed for the analysis of microarray data, but generally applicable) is available in the tilingArray package (Huber and Toedling 2006). For small data sets it is faster than BH, CBS, and the R implementation of BP, but for large data sets, the memory requirements limit its application to problems with frequent change points and blocks of limited length.
Other differences between the procedures are worth noting. Unlike BH and CBS, BP does not assume constant variance. However, a standalone C++ implementation of an extension of BH (allowing for changes in variance) is provided by Loschi and Cruz (2005). A future version of bcp may also allow changes in variance, although simultaneously allowing large numbers of change points seems undesirable for most applications. BP provides the flexibility to specify a minimum segment length and/or the maximum number of breaks, but does not allow for single-observation segments (the algorithm requires at least two observations for the least squares calculations). Thus, the procedure will not correctly adapt to the presence of isolated outliers, or true, single-observation blocks. Although CBS and BH do not allow setting the number of breaks or a minimum segment length, BH provides a hyperparameter, p0, that may be tuned to encourage shorter or longer segments, at the discretion of the user. As demonstrated in the simulations of Section 3.3, CBS performs best when there are longer segments. Finally, we note that both BP and CBS estimate location(s) of the change point(s); this will appeal to many users. However, the BH Bayesian procedure estimates the probability of a change point at each location, providing a more informative summary reflecting the degree of uncertainty in the change points. We expect this feature to be useful in practice.